NVIDIA Ada GPU - 3rd Gen RT and 4th Gen Tensor Cores Deep Dive

NVIDIA has also introduced its 4th Generation Tensor core architecture and 3rd Generation RT cores on Ada GPUs. Now Tensor cores have been available since Volta and consumers got a taste of it with the Turing & Ampere GPUs. One of the key areas where Tensor Cores are put to use for AAA games is DLSS. There's a whole software stack that leverages from Tensor cores and that is known as the NVIDIA NGX. These software-based technologies will help enhance graphics fidelity with features such as Deep Learning Super Sampling (DLSS), AI InPainting, AI Super Rez, RTX Voice, and AI Slow-Mo.

Related StoryHassan Mujtaba
NVIDIA GeForce RTX 4060 Ti May Only Be As Fast As The RTX 3070

While its initial debut was a bit flawed, DLSS in its 2nd iteration (DLSS 2.x) has done wonders to not only improve gaming performance but also image quality.

Let's dive into the technological advancements that allow these incredible achievements. To begin with, NVIDIA engineers started with DLSS Super Resolution and added something called Optical Multi Frame Generation based on Ada's Optical Flow Accelerator.

nvidia-geforce-rtx-30-series-deep-dive_rtx-3080_rtx-3090_rtx-3070_ampere-ga102_ampere-ga104_gpu_graphics-cards_12
nvidia-geforce-rtx-30-series-deep-dive_rtx-3080_rtx-3090_rtx-3070_ampere-ga102_ampere-ga104_gpu_graphics-cards_2
nvidia-geforce-rtx-30-series-deep-dive_rtx-3080_rtx-3090_rtx-3070_ampere-ga102_ampere-ga104_gpu_graphics-cards_3

This accelerator analyzes two sequential frames from a particular game, capturing pixel details such as particles, reflections, lighting, and shadows.

On top of that, NVIDIA DLSS 3 also takes into account conventional game engine information such as motion vectors. The DLSS Frame Generation AI convolutional autoencoder network will then decide how to use each of the four inputs (current and prior frames, optical flow field, and motion vectors) to recreate intermediate frames in the best possible way.

NVIDIA DLSS 3 is said to reconstruct 3/4 of the first frame with DLSS Super Resolution and the full second frame with the help of the aforementioned DLSS Frame Generation. Overall, NVIDIA DLSS 3 reconstructs 7/8 of the two total frames displayed, which explains the massive performance uplift.

Additionally, the new version of the Deep Learning Super Sampling image reconstruction technique also includes the latency-lowering NVIDIA Reflex technology.

Cyberpunk 2077 has been shown running NVIDIA DLSS 3, the brand new Ray Tracing Overdrive, and NVIDIA Reflex with up to 4x improved performance and up to 2x reduced latency. That's not all, as NVIDIA is even promising benefits for CPU-bound games, which generally didn't run much faster with DLSS 2.0. For example, the notoriously CPU-heavy Microsoft Flight Simulator gets up to 2x improved performance with the new DLSS.

Overall, NVIDIA said the following over 35 games and apps already pledged support to NVIDIA DLSS 3.

  • A Plague Tale: Requiem
  • Atomic Heart
  • Black Myth: Wukong
  • Bright Memory: Infinite
  • Chernobylite
  • Conqueror's Blade
  • Cyberpunk 2077
  • Dakar Rally
  • Deliver Us Mars
  • Destroy All Humans! 2 - Reprobed
  • Dying Light 2 Stay Human
  • F1 22
  • F.I.S.T.: Forged In Shadow Torch
  • Frostbite Engine
  • HITMAN 3
  • Hogwarts Legacy
  • ICARUS
  • Jurassic World Evolution 2
  • Justice
  • Loopmancer
  • Marauders
  • Microsoft Flight Simulator
  • Midnight Ghost Hunt
  • Mount & Blade II: Bannerlord
  • Naraka: Bladepoint
  • NVIDIA Omniverse
  • NVIDIA Racer RTX
  • PERISH
  • Portal with RTX
  • Ripout
  • S.T.A.L.K.E.R. 2: Heart of Chornobyl
  • Scathe
  • Sword and Fairy 7
  • SYNCED
  • The Lord of the Rings: Gollum
  • The Witcher 3: Wild Hunt
  • THRONE AND LIBERTY
  • Tower of Fantasy
  • Unity
  • Unreal Engine 4 & 5
  • Warhammer 40,000: Darktide

The green company also released a performance chart on some of those games running on NVIDIA DLSS 3; check it out below.

3rd Gen RT Cores, RTX, and Real-Time Ray Tracing Dissected

Next up, we have the RT Cores, which are what will power Real-Time Raytracing. NVIDIA isn't going to distance itself from traditional rasterization-based rendering but instead follow a hybrid rendering model. The new 3rd Generation RT cores offer increased performance and offer double the ray/triangle intersection testing rate over Turing RT cores.

the Third-Generation RT Core found in Ada GPUs includes dedicated units known as the Opacity Micromap Engine and the Displaced Micro-Mesh Engine. The Opacity Micromap Engine evaluates Opacity Micromaps (represented by the triangle with foliage on the bottom left), which are used to accelerate alpha traversal. The Displaced Micro-Mesh Engine generates meshes of micro-triangles that are known as Displaced Micro-Meshes (represented by the triangle on the bottom right in the diagram below). Displaced Micro-Meshes allow the Ada RT Core to ray trace geometrically complex objects and environments with significantly less BVH build time and storage costs. Finally, ray-triangle intersection testing is 2x faster in Ada’s Third-Generation RT Core compared to the Ampere GPU generation.

NVIDIA engineers have developed three new features in the Ada RT Core to enable high-performance ray tracing of highly complex geometry:

  • First, Ada’s Third-Generation RT Core features 2x Faster Ray-Triangle Intersection Throughput relative to Ampere; this enables developers to add more detail to their virtual worlds.
  • Second, Ada’s RT Core has 2x Faster Alpha Traversal; the RT Core features a new Opacity Micromap Engine to directly alpha-test geometry and significantly reduce shader-based alpha computations. With this new functionality, developers can very compactly describe irregularly shaped or translucent objects, like ferns or fences, and directly and more efficiently ray trace them with the Ada RT Core.
  • Third, the new Ada RT Core supports 10x Faster BVH Build in 20X Less BVH Space when using its new Displaced Micro-Mesh Engine to generate micro-triangles from micro-meshes on-demand. The micro-mesh is a new primitive that represents a structured mesh of micro-triangles that the Ada RT Core processes natively, saving the storage and processing compared to what is normally required when describing complex geometries using only basic triangles.
    Taken together, these three advances incorporated into the Ada RT Core enable order-of-magnitude increases in richness without commensurate increases in processing time or memory consumption.

2x Faster Ray-Triangle Intersection Testing

Ray-triangle intersection testing is a computationally expensive operation that is commonly performed when rendering a ray-traced scene. Recognizing the importance of this function, with each new RTX GPU NVIDIA engineers have strived to improve intersection testing performance and efficiency. The Third-Generation RT Core in the Ada architecture provides double the throughput for ray-triangle intersection testing over Ampere (and 4x faster than the first-generation RT Core used in Turing GPUs).

2x Faster Alpha Traversal Performance with Opacity Micromap Engine

Developers frequently use a texture’s alpha channel to economically cut out complex shapes or more generally to represent translucency. A leaf might be described using a couple of triangles, employing a texture’s alpha channel to economically capture the complex shape. A flame’s complex shape and translucency can also be approximated by alpha.

Prior to Ada’s RT Core, a developer could incorporate these kinds of content into a ray-traced scene by tagging them as not opaque. When a leaf is hit by a ray, a shader is invoked to determine how to treat the intersection, even if the ray is simply characterized as a hit or a miss. This incurs a noticeable cost. Specifically, when a warp of rays is cast towards non-opaque objects, individual ray queries may require multiple shader invocations to resolve, while other rays terminate immediately. The result is lingering live threads and commensurate inefficiency.

To efficiently handle these kinds of content, NVIDIA engineers have added an Opacity Micromap Engine to Ada’s RT Core. An opacity micromap is a virtual mesh of micro-triangles, each with an opacity state that the RT Core uses to directly resolve ray intersections with non-opaque triangles. Specifically, the barycentric coordinates of an intersection are used to address the corresponding micro-triangle’s opacity state. The opacity state may be opaque, transparent, or unknown. If opaque, then a hit is recorded and returned. If transparent, the intersection is ignored and the search for an intersection continues. If unknown, then the control is returned to the SM, invoking a shader (“anyhit”) to programmatically resolve the intersection.

The new Opacity Micromap Engine evaluates the opacity mask, which is a regular triangular mesh defined using the barycentric coordinate system used for reporting ray/triangle intersections. These meshes may be sized from one to sixteen million micro-triangles, with one or two bits associated with each micro-triangle. As a simple illustrative example, consider a detailed maple leaf described using two triangles and an alpha texture

10x Faster BVH Build in 20X Less BVH Space with Ada’s Displaced Micro-Mesh Engine
Geometric complexity continues to rise with every new generation. Ray tracing performance scales attractively with increases in scene complexity. When we ray trace complex environments, tracing costs increase slowly, a one-hundred-fold increase in geometry might only double tracing time.

However, creating the data structure (BVH) that makes that small increase in time possible requires roughly linear time and memory; 100x more geometry could mean 100x more BVH build time and 100x more memory. Ada’s Third-Generation RT Core with Displaced Micro-Meshes (DMM) helps significantly with both of the challenges of high geometric complexity - BVH builds performance and memory/storage footprint. Asset storage and transmission costs are reduced as well.

Secondary rays are generated at each primary ray hit point in the middle scene. Starting at the primary hit surfaces they shoot off in different directions, hitting different objects. Secondary hit shading tends to be less ordered and less efficient when executing on the GPU, because different shader programs are running on different threads, and often must serialize execution. Examples of secondary rays that can benefit from SER include those used for path tracing, reflections, indirect lighting, and translucency effects.

Shader Execution Reordering adds a new stage in the ray tracing pipeline which reorders and groups the secondary hit shading to have better execution locality, thus much higher overall ray-traced shading efficiency. SER can often provide up to 2X performance improvement for RT shaders in cases with a high level of divergence (such as path tracing). In testing with Cyberpunk 2077 running in RT: Overdrive Mode, we’ve measured overall performance gains of up to 44% from SER.

Products mentioned in this post

AMD Ryzen
USD 340

The links above are affiliate links. As an Amazon Associate, Wccftech.com may earn from qualifying purchases.

Filter videos by
Order