NVIDIA GeForce RTX 20 Series Review Ft. RTX 2080 Ti & RTX 2080 Founders Edition Graphics Cards – Turing Ray Traces The Gaming Industry
NVIDIA GeForce RTX 2080 Ti & GeForce RTX 208019th September, 2018
NVIDIA Turing GPU - Turing Streaming Multiprocessor Deep Dive
Let's take a trip down the journey to Turing. In 2016, NVIDIA announced their Pascal GPUs which would soon be featured in their top to bottom GeForce 10 series lineup. After the launch of Maxwell, NVIDIA gained a lot of experience in the efficiency department which they put a focus on since their Kepler GPUs.
Now, with an enhanced FinFET process available, NVIDIA is taking the efficiency lead beyond where it was previously possible, which is completely unrivaled by the competition. With Volta, NVIDIA focused on the AI and HPC market, but most of the features that Volta supported aren’t necessarily needed in the gaming department. Take for instance the double precision floating point execution units. With Pascal, NVIDIA diversified their consumer and HPC GPUs and this time, they are going with a more aggressive approach, completely classifying the consumer GPU in a category of its own. This is where Turing comes in, a GPU designed solely for the consumer segment.
Starting with the most significant part of the Turing GPU architecture, the Turing SM, we are seeing an entirely new graphics core. The Turing SM is made up of a combination of INT32, FP32, and the new Tensor cores.
Coming to the new execution units or cores, Turing has both INT32 and FP32 units which can execute concurrently. This new architectural design allows Turing to execute floating point and non-floating point operations in parallel which allows for up to 36% higher throughput in standard floating point operations.
The Turing SM is partitioned into four processing blocks, each with 16 FP32 Cores, 16 INT32 Cores, two Tensor Cores, one warp scheduler, and one dispatch unit. This adds to 64 FP32 Cores, 64 INT 32 Cores, 8 Tensor, 4 Wrap Schedulers and 4 Dispatch Units on a single Turing SM. Each block also includes a new L0 instruction cache and a 64 KB register file.
The four processing blocks share a combined 96 KB L1 data cache/shared memory. Traditional graphics workloads partition the 96 KB L1/shared memory as 64 KB of dedicated graphics shader RAM and 32 KB for texture cache and register file spill area. Compute workloads can divide the 96 KB into 32 KB shared memory and 64 KB L1 cache, or 64 KB shared memory and 32 KB L1 cache.
The entire SM works in harmony by using different blocks to deliver high performance and better texture caching, enabling for up to 50% better CUDA core performance when compared to the previous generation.
Many of these Turing SMs combine to form the Turing GPU. Each TPC inside the Turing GPU houses 2 Turing SMs which are linked to the raster engine. There are a total of 6 TPCs or 12 Turing SM that are arranged inside the GPC or Graphics Processing Cluster. The top configured TU102 GPU comes with 6 GPCs that are connected to 6 MB of L2 cache, ROPs, TMUs, memory controllers and NVLINK HighSpeed I/O hub. All of this combines to form the massive Turing GPU. Following are some perf figures for the top Turing graphics cards.
NVIDIA GeForce RTX 2080 TI
- 14.2 TFLOPS of peak single precision (FP32) performance
- 28.5 TFLOPS of peak half-precision (FP16) performance
- 14.2 TIPS1 concurrent with FP, through independent integer execution units
- 113.8 Tensor TFLOPS
- 10 Giga Rays/sec
- 78 Tera RTX-OPS
NVIDIA Quadro RTX 8000
- 16.3 TFLOPS of peak single precision (FP32) performance
- 32.6 TFLOPS of peak half-precision (FP16) performance
- 16.3 TIPS1 concurrent with FP, through independent integer execution units
- 130.5 Tensor TFLOPS
- 10 Giga Rays/sec
- 84 Tera RTX-OPS
In terms of shading performance which is the direct result of the enhanced core design and GPU architecture revamp, the Turing GPU offers an average uplift of 50% better performance per core compared to Pascal GPUs. In VR games, the shading performance would be a good 2x ahead than what Pascal achieved while many modern gaming titles show a ~50% lead over Pascal with Turing’s enhanced core design.
It should be pointed out that these are just per core performance gains at the same clock speeds without adding the benefits of other technologies that Turing comes with. That would further increase the performance in a wide variety of gaming applications, since we have already seen the gaming performance of a GeForce RTX 2080 to be 50% faster than the GTX 1080 on average and twice as fast with the new DLSS technology.