NVIDIA Ampere GPU - 2nd Gen RT and 3rd Gen Tensor Cores Deep Dive
NVIDIA has also introduced its 3rd Generation Tensor core architecture and 2nd Generation RT cores on Ampere GPUs. Now Tensor cores have been available since Volta and consumers got a taste of it with the Turing GPUs. One of the key areas where Tensor Cores are put to use for AAA games is DLSS. There's a whole software stack that leverages from Tensor cores and that is known as the NVIDIA NGX. These software-based technologies will help enhance graphics fidelity with features such as Deep Learning Super Sampling (DLSS), AI InPainting, AI Super Rez, RTX Voice, and AI Slow-Mo.
While its initial debut was a bit flawed, DLSS in its 2nd iteration (DLSS 2.0) has done wonders to not only improve gaming performance but also image quality. In titles such as Death Stranding and Control, games are shown to offer higher visual fidelity than at native resolution while running at much higher framerates. With Ampere, we can expect an even higher boost in terms of DLSS 2.0 (and DLSS Next-Gen) performance as the deep-learning model continues working its magic in DLSS supported titles. NVIDIA will also be adding 8K DLSS support to its Ampere GPU lineup which would be great to test out with the 24 GB RTX 3090 graphics card.
With Ampere, Tensor cores add INT8 and INT4 precision in addition to FP16 which is still fully supported. NVIDIA has been at the helm of the deep learning revolution by supporting it since its Kepler generation of graphics cards. Today, NVIDIA has some of the most powerful AI graphics accelerators and a software stack that is widely adopted by this fast-growing industry.
For its 3rd Gen Tensor cores, NVIDIA is using the same sparsity architecture that they've used on the Ampere HPC line of GPUs. While Ampere features 4 Tensor cores per SM compared to Turing's 8 tensor cores per SM, they are not only based on the new 3rd Generation design but also get an increased count with the larger SM array. The Ampere GPU can execute 128 FP16 FMA operations per tensor core utilizing its entire INT16 cores and with sparsity, it can do up to 256. The total FP16 FMA operations per SM are increased to 512 and 1024 with sparsity. That's a 2x increase over the Turing GPU in terms of inference performance with the updated Tensor design.
2nd Gen RT Cores, RTX, and Real-Time Ray Tracing Dissected
Next up, we have the RT Cores which are what will power Real-Time Raytracing. NVIDIA isn't going to distance themselves from traditional rasterization-based rendering, but instead following a hybrid rendering model. The new 2nd Generation RT cores offer increased performance and offer double the ray/triangle intersection testing rate over Turing RT cores.
There's one RT core per SM and all of them combined accelerate Bounding Volume Hierarchy (BVH) traversal and ray/triangle intersection testing (ray casting) functions. RT Cores work together with advanced denoising filtering, a highly-efficient BVH acceleration structure developed by NVIDIA Research, and RTX compatible APIs to achieve real-time ray tracing on a single Turing GPU.
RT Cores traverse the BVH autonomously, and by accelerating traversal and ray/triangle intersection tests, they offload the SM, allowing it to handle another vertex, pixel, and compute shading work. Functions such as BVH building and refitting are handled by the driver, and ray generation and shading are managed by the application through new types of shaders.
To better understand the function of RT Cores, and what exactly they accelerate, we should first explain how ray tracing is performed on GPUs or CPUs without a dedicated hardware ray tracing engine. Essentially, the process of BVH traversal would need to be performed by shader operations and take thousands of instruction slots per ray cast to test against bounding box intersections in the BVH until finally hitting a triangle and the color at the point of intersection contributes to the final pixel color (or if no triangle is hit, the background color may be used to shade a pixel).
Ray tracing without hardware acceleration requires thousands of software instruction slots per ray to test successively smaller bounding boxes in the BVH structure until possibly hitting a triangle. It’s a computationally-intensive process making it impossible to do on GPUs in real-time without hardware-based ray tracing acceleration.
The RT Cores in Ampere can process all the BVH traversal and ray-triangle intersection testing, saving the SM from spending the thousands of instruction slots per ray, which could be an enormous amount of instructions for an entire scene. The RT Core includes two specialized units. The first unit does bounding box tests, and the second unit does ray-triangle intersection tests.
The SM only has to launch a ray probe, and the RT core does the BVH traversal and ray-triangle tests, and return a hit or no hit to the SM. Also unlike the last generation, Ampere SM can process two compute workloads simultaneously, allowing ray-tracing & graphics/compute workloads to be done concurrently.
In a visual demonstration, NVIDIA has shown how RT and Tensor cores help speed up ray tracing and shader workloads significantly. A fully ray-traced frame from Wolfenstein Youngblood was taken as an example. The last-gen RTX 2080 SUPER will take 51ms to render the frame if it does it all with its shaders (CUDA Cores). With RT cores and shaders working in tandem, the processing times are reduced to just 20ms or less than half the time. Adding in Tensor cores to help reduce the rendering time even lower to just 12ms (~83 FPS).
However, with Ampere, each standard processing block receives a huge performance uplift. With an RTX 3080, the same frame can be rendered within 37ms on the Shader cores alone, 11ms with the RT+Shader cores, and 6.7ms (150 FPS) with all three core technologies working together. That's half the time of what Turing took to render the same scene.