NVIDIA GeForce RTX 3080 10 GB “Ampere” Graphics Card Review
NVIDIA GeForce RTX 3080October, 2020
NVIDIA Ampere GPU - Ampere Streaming Multiprocessor, Ampere GPC & Ampere GPUs Deep Dive
Let's take a trip down the journey to Ampere. In 2016, NVIDIA announced their Pascal GPUs which would soon be featured in their top to bottom GeForce 10 series lineup. After the launch of Maxwell, NVIDIA gained a lot of experience in the efficiency department which they put a focus on since their Kepler GPUs. Two years go, NVIDIA, rather than offering another standard leap in the rasterization performance of its GPUs took a different approach & introduced two key technologies in its Turing line of consumer GPUs, one being AI-assisted acceleration with the Tensor Cores and the second being hardware-level acceleration for Ray Tracing with its brand new RT cores.
With Ampere and it's brand new Samsung 8nm fabrication process, NVIDIA is adding even more to its gaming graphics lineup. Starting with the most significant part of the Ampere GPU architecture, the Ampere SM, we are seeing an entirely new graphics core. The Ampere SM features the next-gen FP32, INT32, Tensor Cores, and RT cores.
Coming to the new execution units or cores, Ampere has both INT32 and FP32 units which can execute concurrently. This new architectural design allows Turing to execute floating-point and non-floating point operations in parallel which allows for higher throughput in standard floating-point operations. According to NVIDIA, the updated Ampere graphics core delivers up to 1.7x faster traditional rasterization performance and up to 2x faster ray-tracing performance compared to the Turing GPUs.
The Ampere SM is partitioned into four processing blocks, each with 32 FP32 Cores, 16 INT32 Cores, one Tensor Core, one warp scheduler, and one dispatch unit. This is made possible with an updated datapath with one data path offering 16 FP32 execution units while the other offers either 16 FP32 or 16 INT32 execution units. This adds to 128 FP32 Cores, 64 INT 32 Cores,4 Tensor, 4 Wrap Schedulers, and 4 Dispatch Units on a single Ampere SM. Each block also includes a new L0 instruction cache and a 64 KB register file for a total of 256 KB register file per SM.
One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.
Doubling the processing speed for FP32 improves performance for a number of common graphics and compute operations and algorithms. Modern shader workloads typically have a mixture of FP32 arithmetic instructions such as FFMA, floating point additions (FADD), or floating point multiplications (FMUL), combined with simpler instructions such as integer adds for addressing and fetching data, floating point compare, or min/max for processing results, etc. Performance gains will vary at the shader and application level depending on the mix of instructions. Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.
Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.
Like prior NVIDIA GPUs, Ampere is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and memory controllers.
The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.
The four processing blocks share a combined 128 KB L1 data cache/shared memory. Traditional graphics workloads partition the 128 KB L1/shared memory as 64 KB of dedicated graphics shader RAM and 64 KB for texture cache and register file spill area. In compute mode, the GA10x SM will support the following configurations:
- 128 KB L1 + 0 KB Shared Memory
- 120 KB L1 + 8 KB Shared Memory
- 112 KB L1 + 16 KB Shared Memory
- 96 KB L1 + 32 KB Shared Memory
- 64 KB L1 + 64 KB Shared Memory
- 28 KB L1 + 100 KB Shared Memory
Ampere also ties its ROPs to the HPC and houses a total of 16 ROP units per GPC. The full GA102 GPU feature 112 ROPs while the GeForce RTX 3080 comes with a total of 96 ROPs.
The entire SM works in harmony by using different blocks to deliver high performance and better texture caching, enabling for up to twice as better CUDA core performance when compared to the previous generation.
Many of these Ampere SMs combine to form the Ampere GPU. Each TPC inside the Ampere GPU houses 2 Turing SMs which are linked to the raster engine. There are a total of 6 TPCs or 12 Ampere SM that are arranged inside the GPC or Graphics Processing Cluster. The top configured GA102 GPU comes with 7 GPCs with a total of 42 TPCs and 84 SMs that are connected to 10 MB of L1 and 6 MB of L2 cache, ROPs, TMUs, memory controllers, and NVLINK HighSpeed I/O hub. All of this combines to form the massive Ampere GA102 GPU. The following are some perf figures for the top Ampere graphics cards.
NVIDIA GeForce RTX 3090
- 35.58 TFLOPS of peak single-precision (FP32) performance
- 71.16 TFLOPS of peak half-precision (FP16) performance
- 17.79 TIPS1 concurrent with FP, through independent integer execution units
- 258 Tensor TFLOPS
- 69 RT-TFLOPs
NVIDIA GeForce RTX 3080
- 30 TFLOPS of peak single-precision (FP32) performance
- 60 TFLOPS of peak half-precision (FP16) performance
- 15 TIPS1 concurrent with FP, through independent integer execution units
- 238 Tensor TFLOPS
- 58 RT-TFLOPs
In terms of shading performance which is the direct result of the enhanced core design and GPU architecture revamp, the Ampere GPU offers an uplift of up to 70% better performance per core compared to Turing GPUs.
It should be pointed out that these are just per core performance gains at the same clock speeds without adding the benefits of other technologies that Ampere comes with. That would further increase the performance in a wide variety of gaming applications.
NVIDIA Ampere "GeForce RTX 30" GPUs Full Breakdown:
|Graphics Card||NVIDIA GeForce RTX 2070 SUPER||NVIDIA GeForce RTX 3070||NVIDIA GeForce RTX 2080||NVIDIA GeForce RTX 3080||NVIDIA Titan RTX||NVIDIA GeForce RTX 3090|
|GPU Architecture||NVIDIA Turing||NVIDIA Ampere||NVIDIA Turing||NVIDIA Ampere||NVIDIA Turing||NVIDIA Ampere|
|GPCs||5 or 6||6||6||6||6||7|
|CUDA Cores / SM||64||128||64||128||64||128|
|CUDA Cores / GPU||2560||5888||2944||8704||4608||10496|
|Tensor Cores / SM||8 (2nd Gen)||4 (3rd Gen)||8 (2nd Gen)||4 (3rd Gen)||8 (2nd Gen)||4 (3rd Gen)|
|Tensor Cores / GPU||320 (2nd Gen)||184 (3rd Gen)||368||272 (3rd Gen)||576 (2nd Gen)||328 (3rd Gen)|
|RT Cores||40 (1st Gen)||46 (2nd Gen)||46 (1st Gen)||68 (2nd Gen)||72 (1st Gen)||82 (2nd Gen)|
|GPU Boost Clock (MHz)||1770||1725||1800||1710||1770||1695|
|Peak FP32 TFLOPS (non-Tensor)||9.1||20.3||10.6||29.8||16.3||35.6|
|Peak FP16 TFLOPS (non-Tensor)||18.1||20.3||21.2||29.8||32.6||35.6|
|Peak BF16 TFLOPS (non-Tensor)||NA||20.3||NA||29.8||NA||35.6|
|Peak INT32 TOPS (non-Tensor)||9.1||10.2||10.6||14.9||16.3||17.8|
|Peak FP16 Tensor TFLOPS|
with FP16 Accumulate
|Peak FP16 Tensor TFLOPS|
with FP32 Accumulate
|Peak BF16 Tensor TFLOPS|
with FP32 Accumulate
|Peak TF32 Tensor TFLOPS||NA||20.3/40.6||NA||29.8/59.5||NA||35.6/71|
|Peak INT8 Tensor TOPS||145||162.6/325.2||169.6||238/476||261||284/568|
|Peak INT4 Tensor TOPS||290||325.2/650.4||339.1||476/952||522||568/1136|
|Frame Buffer Memory Size and|
|8 GB GDDR6||8 GB GDDR6||8 GB GDDR6||10 GB GDDR6X||24 GB GDDR6||24 GB GDDR6X|
|Memory Clock (Data Rate)||14 Gbps||14 Gbps||14 Gbps||19 Gbps||14 Gbps||19.5 Gbps|
|Memory Bandwidth||448 GB/sec||448 GB/sec||448 GB/sec||760 GB/sec||672 GB/sec||936 GB/sec|
|Pixel Fill-rate (Gigapixels/sec)||113.3||165.6||115.2||164.2||169.9||193|
|Texel Fill-rate (Gigatexels/sec)||283.2||317.4||331.2||465||509.8||566|
|L1 Data Cache/Shared Memory||3840||5888||4416 KB||8704 KB||6912 KB||10496 KB|
|L2 Cache Size||4096 KB||4096 KB||4096 KB||5120 KB||6144 KB||6144 KB|
|Register File Size||10240 KB||11776 KB||11776 KB||17408 KB||18432 KB||20992 KB|
|TGP (Total Graphics Power)||215 Watts||220W||225W||320W||280W||350W|
|Transistor Count||13.6 Billion||17.4 Billion||13.6 Billion||28.3 Billion||18.6 Billion||28.3 Billion|
|Die Size||545 mm2||392.5 mm2||545 mm2||628.4 mm2||754mm2||628.4 mm2|
|Manufacturing Process||TSMC 12 nm FFN|
|Samsung 8 nm 8N NVIDIA|
|TSMC 12 nm FFN|
|Samsung 8 nm 8N NVIDIA|
|TSMC 12 nm FFN|
|Samsung 8 nm 8N NVIDIA
NVIDIA Ampere GPUs - GA102 & GA104 For The First Wave of Gaming Cards
NVIDIA is first introducing two brand new Ampere GPUs which include the GA102 and the GA104. The GA102 GPU is going to be featured on the GeForce RTX 3090 and GeForce RTX 3080 graphics cards while the GA104 GPU is going to be featured on the GeForce RTX 3070 graphics cards. The Ampere GPUs are based on the Samsung 8nm custom process node for NVIDIA and as such, the resultant GPU dies are slightly smaller than their Turing based predecessors but do come with a denser transistor layout. There will be several variations of each GPU featured across the RTX 30 series lineup. Following is what the complete GA102 and GA104 GPUs have to offer.
NVIDIA Ampere GA102 GPU
The full GA102 GPU is made up of 7 graphics processing clusters with 12 SM units on each cluster. That makes up 84 SM units for a total of 10752 cores in a 28.3 billion transistor package measuring 628.4mm2.
NVIDIA Ampere GA104 GPU
The full GA104 GPU is made up of 6 graphics processing clusters with 8 SM units on each cluster. That makes up 48 SM units for a total of 6144 cores in a 17.4 billion transistor package measuring 392.5mm2.