cover
Hardware

NVIDIA GeForce RTX 20 Series Review Ft. RTX 2080 Ti & RTX 2080 Founders Edition Graphics Cards – Turing Ray Traces The Gaming Industry

Hassan Mujtaba & Keith May

NVIDIA Turing GPU - Turing RT and Tensor Cores Deep Dive

The other significant part of the Turing GPU and the most talked about feature of the whole Turing family is the support of Tensor Cores. Now Tensor cores have been available since Volta, but not on consumer cards, let alone gaming products. With Turing, Tensor cores add INT8 and INT4 precision in addition to FP16 which is still fully supported. NVIDIA has been at the helm of the deep learning revolution by supporting it since their Kepler generation of graphics cards. Today, NVIDIA has some of the most powerful AI graphics accelerators and a software stack that is widely adopted by this fast-growing industry.

Related Story NVIDIA Moves Gaming Segment Under “Edge Computing”, Posts 29% Revenue Growth From Blackwell Workstations But Gaming GPUs Slow Down Due To “Elevated” Memory Prices

There's a whole software stack that leverages from Tensor cores and that is known as the NVIDIA NGX. These software-based technologies will help enhance graphics fidelity with features such as Deep Learning Super Sampling (DLSS), AI InPainting, AI Super Rez, and AI Slow-Mo.

nvidia-geforce-20-series_official_turing_tensor-core
nvidia-geforce-20-series_official_turing_tensor-operations
nvidia-geforce-20-series_official_turing_tensor-cores
nvidia-geforce-20-series_official_turing_turing-tensor
nvidia-geforce-20-series_official_turing_dll

As mentioned earlier, there are 8 Tensor Cores per SM block and 16 in a single TPC. The flagship TU102 GPU contains 576 Tensor Cores. A single SM can perform a total of 512 FP16 operations, 1024 FP operations, 2048 INT8 operations and 4096 INT4 operations per clock cycle. This level of performance is utilized in both deep learning training and inferencing operations. The recent announcement of the Tesla T4 at GTC Japan 2018 shows that Turing GPU has use cases in the Tesla market too.

RT Cores, RTX and Real-Time Ray Tracing Dissected

Next up, we have the RT Cores which are what will power Real Time Raytracing. NVIDIA isn't going to distance themselves from traditional rasterization-based rendering, but instead following a hybrid rendering model. The reason being that while GeForce RTX cards are miles ahead in ray tracing performance compared to previous generation cards, they still lack the horsepower to fully ray-trace an entire screen so developers would have to use a very small amount of rays per pixel available on the screen which is around 1 at the lower and 10 at the highest end.

nvidia-geforce-20-series_official_turing_ray-tracing-cores
nvidia-geforce-20-series_official_turing_rt-core

There's one RT core per SM and all of them combined accelerate Bounding Volume Hierarchy (BVH) traversal and ray/triangle intersection testing (ray casting) functions. RT Cores work together with advanced denoising filtering, a highly-efficient BVH acceleration structure developed by NVIDIA Research, and RTX compatible APIs to achieve real-time ray tracing on a single Turing GPU.

RT Cores traverse the BVH autonomously, and by accelerating traversal and ray/triangle intersection tests, they offload the SM, allowing it to handle another vertex, pixel, and compute shading work. Functions such as BVH building and refitting are handled by the driver, and ray generation and shading is managed by the application through new types of shaders.

nvidia-geforce-20-series_official_turing_ray-tracing_7
nvidia-geforce-20-series_official_turing_ray-tracing_8
nvidia-geforce-20-series_official_turing_ray-tracing_10
nvidia-geforce-20-series_official_turing_ray-tracing_11

To better understand the function of RT Cores, and what exactly they accelerate, we should first explain how ray tracing is performed on GPUs or CPUs without a dedicated hardware ray tracing engine. Essentially, the process of BVH traversal would need to be performed by shader operations and take thousands of instruction slots per ray cast to test against bounding box intersections in the BVH until finally hitting a triangle and the color at the point of intersection contributes to final pixel color (or if no triangle is hit, background color may be used to shade a pixel).

Ray tracing without hardware acceleration requires thousands of software instruction slots per ray to test successively smaller bounding boxes in the BVH structure until possibly hitting a triangle. It’s a computationally intensive process making it impossible to do on GPUs in real-time without hardware-based ray tracing acceleration.

The RT Cores in Turing can process all the BVH traversal and ray-triangle intersection testing, saving the SM from spending the thousands of instruction slots per ray, which could be an enormous amount of instructions for an entire scene. The RT Core includes two specialized units. The first unit does bounding box tests, and the second unit does ray-triangle intersection tests. The SM only has to launch a ray probe, and the RT core does the BVH traversal and ray-triangle tests, and return a hit or no hit to the SM. The SM is largely freed up to do other graphics or Compute work.

nvidia-geforce-20-series_official_turing_ray-tracing_1
nvidia-geforce-20-series_official_turing_ray-tracing_2
nvidia-geforce-20-series_official_turing_ray-tracing_3
nvidia-geforce-20-series_official_turing_ray-tracing_4
nvidia-geforce-20-series_official_turing_ray-tracing_5
nvidia-geforce-20-series_official_turing_ray-tracing_6

Turing ray tracing performance with RT Cores is significantly faster than ray tracing in Pascal GPUs. Turing can deliver far more Giga Rays/Sec than Pascal on different workloads, as shown in Figure 19. Pascal is spending approximately 1.1 Giga Rays/Sec, or 10 TFLOPS / Giga Ray to do ray tracing in software, whereas Turing can do 10+ Giga Rays/Sec using RT Cores, and run ray tracing 10 times faster.

You can find additional information about our hardware review process and ethics policy here.

Hassan Mujtaba Photo

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Button