NVIDIA GeForce RTX 20 Series Review Ft. RTX 2080 Ti & RTX 2080 Founders Edition Graphics Cards – Turing Ray Traces The Gaming Industry
NVIDIA GeForce RTX 2080 Ti & GeForce RTX 208019th September, 2018
NVIDIA Turing GPU - Turing RT and Tensor Cores Deep Dive
The other significant part of the Turing GPU and the most talked about feature of the whole Turing family is the support of Tensor Cores. Now Tensor cores have been available since Volta, but not on consumer cards, let alone gaming products. With Turing, Tensor cores add INT8 and INT4 precision in addition to FP16 which is still fully supported. NVIDIA has been at the helm of the deep learning revolution by supporting it since their Kepler generation of graphics cards. Today, NVIDIA has some of the most powerful AI graphics accelerators and a software stack that is widely adopted by this fast-growing industry.
There's a whole software stack that leverages from Tensor cores and that is known as the NVIDIA NGX. These software-based technologies will help enhance graphics fidelity with features such as Deep Learning Super Sampling (DLSS), AI InPainting, AI Super Rez, and AI Slow-Mo.
As mentioned earlier, there are 8 Tensor Cores per SM block and 16 in a single TPC. The flagship TU102 GPU contains 576 Tensor Cores. A single SM can perform a total of 512 FP16 operations, 1024 FP operations, 2048 INT8 operations and 4096 INT4 operations per clock cycle. This level of performance is utilized in both deep learning training and inferencing operations. The recent announcement of the Tesla T4 at GTC Japan 2018 shows that Turing GPU has use cases in the Tesla market too.
RT Cores, RTX and Real-Time Ray Tracing Dissected
Next up, we have the RT Cores which are what will power Real Time Raytracing. NVIDIA isn't going to distance themselves from traditional rasterization-based rendering, but instead following a hybrid rendering model. The reason being that while GeForce RTX cards are miles ahead in ray tracing performance compared to previous generation cards, they still lack the horsepower to fully ray-trace an entire screen so developers would have to use a very small amount of rays per pixel available on the screen which is around 1 at the lower and 10 at the highest end.
There's one RT core per SM and all of them combined accelerate Bounding Volume Hierarchy (BVH) traversal and ray/triangle intersection testing (ray casting) functions. RT Cores work together with advanced denoising filtering, a highly-efficient BVH acceleration structure developed by NVIDIA Research, and RTX compatible APIs to achieve real-time ray tracing on a single Turing GPU.
RT Cores traverse the BVH autonomously, and by accelerating traversal and ray/triangle intersection tests, they offload the SM, allowing it to handle another vertex, pixel, and compute shading work. Functions such as BVH building and refitting are handled by the driver, and ray generation and shading is managed by the application through new types of shaders.
To better understand the function of RT Cores, and what exactly they accelerate, we should first explain how ray tracing is performed on GPUs or CPUs without a dedicated hardware ray tracing engine. Essentially, the process of BVH traversal would need to be performed by shader operations and take thousands of instruction slots per ray cast to test against bounding box intersections in the BVH until finally hitting a triangle and the color at the point of intersection contributes to final pixel color (or if no triangle is hit, background color may be used to shade a pixel).
Ray tracing without hardware acceleration requires thousands of software instruction slots per ray to test successively smaller bounding boxes in the BVH structure until possibly hitting a triangle. It’s a computationally intensive process making it impossible to do on GPUs in real-time without hardware-based ray tracing acceleration.
The RT Cores in Turing can process all the BVH traversal and ray-triangle intersection testing, saving the SM from spending the thousands of instruction slots per ray, which could be an enormous amount of instructions for an entire scene. The RT Core includes two specialized units. The first unit does bounding box tests, and the second unit does ray-triangle intersection tests. The SM only has to launch a ray probe, and the RT core does the BVH traversal and ray-triangle tests, and return a hit or no hit to the SM. The SM is largely freed up to do other graphics or Compute work.
Turing ray tracing performance with RT Cores is significantly faster than ray tracing in Pascal GPUs. Turing can deliver far more Giga Rays/Sec than Pascal on different workloads, as shown in Figure 19. Pascal is spending approximately 1.1 Giga Rays/Sec, or 10 TFLOPS / Giga Ray to do ray tracing in software, whereas Turing can do 10+ Giga Rays/Sec using RT Cores, and run ray tracing 10 times faster.