NVIDIA Creates Interactive World with Its Deep Learning-Based AI Model: ‘It Wouldn’t Have Been Possible Before Tensor Cores’

Author Photo
Dec 3

Today’s big NVIDIA announcement presented the TITAN RTX GPU. However, there is another interesting press release put out by the company in which they provide the first look at an interactive, AI rendered virtual world based on a deep learning model.

A team of researchers at NVIDIA used a neural network, previously trained on real-world videos, to render synthetic tridimensional environments in real time. The result is a basic driving game, as you can see below; the full demo will be showcased at the NeurIPS conference in Montreal, Canada.

nvidia-geforce-20-series_official_turing_ngx_dnn_dlss_performance-4k-1Related NVIDIA DLSS Explained – Much Higher Quality Than TAA or Much Faster Performance, Delivered by NVIDIA NGX

Bryan Catanzaro, vice president of Applied Deep Learning Research at NVIDIA and leader of the research team, said:

NVIDIA has been inventing new ways to generate interactive graphics for 25 years, and this is the first time we can do so with a neural network. Neural networks — specifically generative models — will change how graphics are created. This will enable developers to create new scenes at a fraction of the traditional cost.

One of the main obstacles developers face when creating virtual worlds, whether for game development, telepresence, or other applications is that creating the content is expensive. This method allows artists and developers to create at a much lower cost, by using AI that learns from the real world. Before Tensor Cores, this demo would not have been possible.

Clearly, the potential here is massive. The creation of massive virtual worlds is the basis of modern gaming and it is a highly time and resource consuming process. Being able to speed it up through AI would do wonders for game developers, but NVIDIA also expects applications in fields like virtual reality, automotive, robotics and architecture.

Those of you who aren’t afraid to get really technical can dive into the entire research paper, available here. Anyone else can read the summary below, where the researchers have outlined a couple of current limitations of their model and how they could be overcome.


We present a general video-to-video synthesis framework based on conditional GANs. Through carefully-designed generator and discriminator networks as well as a spatiotemporal adversarial objective, we can synthesize high-resolution, photorealistic, and temporally consistent videos. Extensive experiments demonstrate that our results are significantly better than the results by state-of-the-art methods. Its extension to the future video prediction task also compares favorably against the competing approaches.

Limitations and future work

Although our approach outperforms previous methods, our model still fails in a couple of situations. For example, our model struggles in synthesizing turning cars due to insufficient information in label maps. We speculate that this could be potentially addressed by adding additional 3D cues, such as depth maps.

Furthermore, our model still can not guarantee that an object has a consistent appearance across the whole video. Occasionally, a car may change its color gradually. This issue might be alleviated if object tracking information is used to enforce that the same object shares the same appearance throughout the entire video. Finally, when we perform semantic manipulations such as turning trees into buildings, visible artifacts occasionally appear as building and trees have different label shapes. This might be resolved if we train our model with coarser semantic labels, as the trained model would be less sensitive to label shapes.