Hardware

NVIDIA GeForce RTX 3090 24 GB “Ampere” Founders Edition Review – A True BFGPU

Keith May • Sep 24, 2020 at 09:00am EDT

NVIDIA Ampere GPU - Ampere Streaming Multiprocessor, Ampere GPC & Ampere GPUs Deep Dive

Let's take a trip down the journey to Ampere. In 2016, NVIDIA announced their Pascal GPUs which would soon be featured in their top to bottom GeForce 10 series lineup. After the launch of Maxwell, NVIDIA gained a lot of experience in the efficiency department which they put a focus on since their Kepler GPUs. Two years go, NVIDIA, rather than offering another standard leap in the rasterization performance of its GPUs took a different approach & introduced two key technologies in its Turing line of consumer GPUs, one being AI-assisted acceleration with the Tensor Cores and the second being hardware-level acceleration for Ray Tracing with its brand new RT cores.

With Ampere and it's brand new Samsung 8nm fabrication process, NVIDIA is adding even more to its gaming graphics lineup. Starting with the most significant part of the Ampere GPU architecture, the Ampere SM, we are seeing an entirely new graphics core. The Ampere SM features the next-gen FP32, INT32, Tensor Cores, and RT cores.

Coming to the new execution units or cores, Ampere has both INT32 and FP32 units which can execute concurrently. This new architectural design allows Turing to execute floating-point and non-floating point operations in parallel which allows for higher throughput in standard floating-point operations. According to NVIDIA, the updated Ampere graphics core delivers up to 1.7x faster traditional rasterization performance and up to 2x faster ray-tracing performance compared to the Turing GPUs.

The Ampere SM is partitioned into four processing blocks, each with 32 FP32 Cores, 16 INT32 Cores, one Tensor Core, one warp scheduler, and one dispatch unit. This is made possible with an updated datapath with one data path offering 16 FP32 execution units while the other offers either 16 FP32 or 16 INT32 execution units. This adds to 128 FP32 Cores, 64 INT 32 Cores,4 Tensor, 4 Wrap Schedulers, and 4 Dispatch Units on a single Ampere SM. Each block also includes a new L0 instruction cache and a 64 KB register file for a total of 256 KB register file per SM.

One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.

Doubling the processing speed for FP32 improves performance for a number of common graphics and compute operations and algorithms. Modern shader workloads typically have a mixture of FP32 arithmetic instructions such as FFMA, floating point additions (FADD), or floating point multiplications (FMUL), combined with simpler instructions such as integer adds for addressing and fetching data, floating point compare, or min/max for processing results, etc. Performance gains will vary at the shader and application level depending on the mix of instructions. Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.

Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.

Like prior NVIDIA GPUs, Ampere is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and memory controllers.

The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.

nvidia-geforce-rtx-30-series-deep-dive_rtx-3080_rtx-3090_rtx-3070_ampere-ga102_ampere-ga104_gpu_graphics-cards_45

nvidia-geforce-rtx-30-series-deep-dive_rtx-3080_rtx-3090_rtx-3070_ampere-ga102_ampere-ga104_gpu_graphics-cards_14

The four processing blocks share a combined 128 KB L1 data cache/shared memory. Traditional graphics workloads partition the 128 KB L1/shared memory as 64 KB of dedicated graphics shader RAM and 64 KB for texture cache and register file spill area. In compute mode, the GA10x SM will support the following configurations:

128 KB L1 + 0 KB Shared Memory
120 KB L1 + 8 KB Shared Memory
112 KB L1 + 16 KB Shared Memory
96 KB L1 + 32 KB Shared Memory
64 KB L1 + 64 KB Shared Memory
28 KB L1 + 100 KB Shared Memory

Ampere also ties its ROPs to the HPC and houses a total of 16 ROP units per GPC. The full GA102 GPU feature 112 ROPs while the GeForce RTX 3080 comes with a total of 96 ROPs.

NVIDIA GeForce RTX 30 "AMPERE" Graphics Card SM Block Diagram — The block diagram of the NVIDIA Ampere SM Gaming GPUs.

The entire SM works in harmony by using different blocks to deliver high performance and better texture caching, enabling for up to twice as better CUDA core performance when compared to the previous generation.

NVIDIA GeForce RTX 3080 GA102 GPU Block Diagram — A block diagram of the GA102 GPU featured on the NVIDIA GeForce RTX 3080 graphics card.

Many of these Ampere SMs combine to form the Ampere GPU. Each TPC inside the Ampere GPU houses 2 Turing SMs which are linked to the raster engine. There are a total of 6 TPCs or 12 Ampere SM that are arranged inside the GPC or Graphics Processing Cluster. The top configured GA102 GPU comes with 7 GPCs with a total of 42 TPCs and 84 SMs that are connected to 10 MB of L1 and 6 MB of L2 cache, ROPs, TMUs, memory controllers, and NVLINK HighSpeed I/O hub. All of this combines to form the massive Ampere GA102 GPU. The following are some perf figures for the top Ampere graphics cards.

NVIDIA GeForce RTX 3090

35.58 TFLOPS of peak single-precision (FP32) performance
71.16 TFLOPS of peak half-precision (FP16) performance
17.79 TIPS1 concurrent with FP, through independent integer execution units
258 Tensor TFLOPS
69 RT-TFLOPs

NVIDIA GeForce RTX 3080

30 TFLOPS of peak single-precision (FP32) performance
60 TFLOPS of peak half-precision (FP16) performance
15 TIPS1 concurrent with FP, through independent integer execution units
238 Tensor TFLOPS
58 RT-TFLOPs

In terms of shading performance which is the direct result of the enhanced core design and GPU architecture revamp, the Ampere GPU offers an uplift of up to 70% better performance per core compared to Turing GPUs.

It should be pointed out that these are just per core performance gains at the same clock speeds without adding the benefits of other technologies that Ampere comes with. That would further increase the performance in a wide variety of gaming applications.

NVIDIA Ampere "GeForce RTX 30" GPUs Full Breakdown:

Graphics Card	NVIDIA GeForce RTX 2070 SUPER	NVIDIA GeForce RTX 3070	NVIDIA GeForce RTX 2080	NVIDIA GeForce RTX 3080	NVIDIA Titan RTX	NVIDIA GeForce RTX 3090
GPU Codename	TU106	GA104	TU104	GA102	TU102	GA102
GPU Architecture	NVIDIA Turing	NVIDIA Ampere	NVIDIA Turing	NVIDIA Ampere	NVIDIA Turing	NVIDIA Ampere
GPCs	5 or 6	6	6	6	6	7
TPCs	20	23	23	34	36	41
SMs	40	46	46	68	72	82
CUDA Cores / SM	64	128	64	128	64	128
CUDA Cores / GPU	2560	5888	2944	8704	4608	10496
Tensor Cores / SM	8 (2nd Gen)	4 (3rd Gen)	8 (2nd Gen)	4 (3rd Gen)	8 (2nd Gen)	4 (3rd Gen)
Tensor Cores / GPU	320 (2nd Gen)	184 (3rd Gen)	368	272 (3rd Gen)	576 (2nd Gen)	328 (3rd Gen)
RT Cores	40 (1st Gen)	46 (2nd Gen)	46 (1st Gen)	68 (2nd Gen)	72 (1st Gen)	82 (2nd Gen)
GPU Boost Clock (MHz)	1770	1725	1800	1710	1770	1695
Peak FP32 TFLOPS (non-Tensor)	9.1	20.3	10.6	29.8	16.3	35.6
Peak FP16 TFLOPS (non-Tensor)	18.1	20.3	21.2	29.8	32.6	35.6
Peak BF16 TFLOPS (non-Tensor)	NA	20.3	NA	29.8	NA	35.6
Peak INT32 TOPS (non-Tensor)	9.1	10.2	10.6	14.9	16.3	17.8
Peak FP16 Tensor TFLOPS with FP16 Accumulate	72.5	81.3/162.6	84.8	119/238	130.5	142/284
Peak FP16 Tensor TFLOPS with FP32 Accumulate	36.3	40.6/81.3	42.4	59.5/119	65.2	71/142
Peak BF16 Tensor TFLOPS with FP32 Accumulate	NA	40.6/81.3	NA	59.5/119	NA	71/142
Peak TF32 Tensor TFLOPS	NA	20.3/40.6	NA	29.8/59.5	NA	35.6/71
Peak INT8 Tensor TOPS	145	162.6/325.2	169.6	238/476	261	284/568
Peak INT4 Tensor TOPS	290	325.2/650.4	339.1	476/952	522	568/1136
Frame Buffer Memory Size and Type	8 GB GDDR6	8 GB GDDR6	8 GB GDDR6	10 GB GDDR6X	24 GB GDDR6	24 GB GDDR6X
Memory Interface	256-bit	256-bit	256-bit	320-bit	384-bit	384-bit
Memory Clock (Data Rate)	14 Gbps	14 Gbps	14 Gbps	19 Gbps	14 Gbps	19.5 Gbps
Memory Bandwidth	448 GB/sec	448 GB/sec	448 GB/sec	760 GB/sec	672 GB/sec	936 GB/sec
ROPs	64	96	64	96	96	112
Pixel Fill-rate (Gigapixels/sec)	113.3	165.6	115.2	164.2	169.9	193
Texture Units	160	184	184	272	288	328
Texel Fill-rate (Gigatexels/sec)	283.2	317.4	331.2	465	509.8	566
L1 Data Cache/Shared Memory	3840	5888	4416 KB	8704 KB	6912 KB	10496 KB
L2 Cache Size	4096 KB	4096 KB	4096 KB	5120 KB	6144 KB	6144 KB
Register File Size	10240 KB	11776 KB	11776 KB	17408 KB	18432 KB	20992 KB
TGP (Total Graphics Power)	215 Watts	220W	225W	320W	280W	350W
Transistor Count	13.6 Billion	17.4 Billion	13.6 Billion	28.3 Billion	18.6 Billion	28.3 Billion
Die Size	545 mm2	392.5 mm2	545 mm2	628.4 mm2	754mm2	628.4 mm2
Manufacturing Process	TSMC 12 nm FFN (FinFET NVIDIA)	Samsung 8 nm 8N NVIDIA Custom Process	TSMC 12 nm FFN (FinFET NVIDIA)	Samsung 8 nm 8N NVIDIA Custom Process	TSMC 12 nm FFN (FinFET NVIDIA)	Samsung 8 nm 8N NVIDIA Custom Process

NVIDIA Ampere GPUs - GA102 & GA104 For The First Wave of Gaming Cards

NVIDIA is first introducing two brand new Ampere GPUs which include the GA102 and the GA104. The GA102 GPU is going to be featured on the GeForce RTX 3090 and GeForce RTX 3080 graphics cards while the GA104 GPU is going to be featured on the GeForce RTX 3070 graphics cards. The Ampere GPUs are based on the Samsung 8nm custom process node for NVIDIA and as such, the resultant GPU dies are slightly smaller than their Turing based predecessors but do come with a denser transistor layout. There will be several variations of each GPU featured across the RTX 30 series lineup. Following is what the complete GA102 and GA104 GPUs have to offer.

NVIDIA Ampere GA102 GPU

The full GA102 GPU is made up of 7 graphics processing clusters with 12 SM units on each cluster. That makes up 84 SM units for a total of 10752 cores in a 28.3 billion transistor package measuring 628.4mm2.

NVIDIA Ampere GA104 GPU

The full GA104 GPU is made up of 6 graphics processing clusters with 8 SM units on each cluster. That makes up 48 SM units for a total of 6144 cores in a 17.4 billion transistor package measuring 392.5mm2.

You can find additional information about our hardware review process and ethics policy here.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Read all comments on NVIDIA GeForce RTX 3090 24 GB “Ampere” Founders Edition Review – A True BFGPU

NVIDIA GeForce RTX 3090 24 GB “Ampere” Founders Edition Review – A True BFGPU

NVIDIA Ampere GPU - Ampere Streaming Multiprocessor, Ampere GPC & Ampere GPUs Deep Dive

Related Story ASUS Strix RTX 3090 Has A Hidden Design Flaw That Kills The VRM, And Owners Are Finding Out The Hard Way

NVIDIA Ampere "GeForce RTX 30" GPUs Full Breakdown:

NVIDIA Ampere GPUs - GA102 & GA104 For The First Wave of Gaming Cards

Contents

Further Reading

Intel’s Top Granite Rapids Xeon Workstation CPU Leaked: 86 Cores & 172 Threads On The W890 Platform

Modded RTX 3090 And 3080s Sold As GeForce RTX 4090 GPUs; Technician Spots The Key Differences Between Original And Fake Cards

NVIDIA GeForce RTX 3090 SUPER Founders Edition Graphics Card Pictured Once Again

GeForce RTX 3080/3090 GPU, Laptop, and Desktop Purchases Come Bundled with Marvel’s Spider-Man Remastered for a Limited Time