NVIDIA Ampere GPU - Ampere Streaming Multiprocessor, Ampere GPC & Ampere GPUs Deep Dive

Let's take a trip down the journey to Ampere. In 2016, NVIDIA announced their Pascal GPUs which would soon be featured in their top to bottom GeForce 10 series lineup. After the launch of Maxwell, NVIDIA gained a lot of experience in the efficiency department which they put a focus on since their Kepler GPUs. Two years go, NVIDIA, rather than offering another standard leap in the rasterization performance of its GPUs took a different approach & introduced two key technologies in its Turing line of consumer GPUs, one being AI-assisted acceleration with the Tensor Cores and the second being hardware-level acceleration for Ray Tracing with its brand new RT cores.

With Ampere and it's brand new Samsung 8nm fabrication process, NVIDIA is adding even more to its gaming graphics lineup. Starting with the most significant part of the Ampere GPU architecture, the Ampere SM, we are seeing an entirely new graphics core. The Ampere SM features the next-gen FP32, INT32, Tensor Cores, and RT cores.

Related StoryHassan Mujtaba
NVIDIA Compares GeForce RTX 3090 Rendering Performance To An AMD Ryzen Threadripper 3990X 64 Core CPU, Cuts An Hour of Render Time

Coming to the new execution units or cores, Ampere has both INT32 and FP32 units which can execute concurrently. This new architectural design allows Turing to execute floating-point and non-floating point operations in parallel which allows for higher throughput in standard floating-point operations. According to NVIDIA, the updated Ampere graphics core delivers up to 1.7x faster traditional rasterization performance and up to 2x faster ray-tracing performance compared to the Turing GPUs.

The Ampere SM is partitioned into four processing blocks, each with 32 FP32 Cores, 16 INT32 Cores, one Tensor Core, one warp scheduler, and one dispatch unit. This is made possible with an updated datapath with one data path offering 16 FP32 execution units while the other offers either 16 FP32 or 16 INT32 execution units. This adds to 128 FP32 Cores, 64 INT 32 Cores,4 Tensor, 4 Wrap Schedulers, and 4 Dispatch Units on a single Ampere SM. Each block also includes a new L0 instruction cache and a 64 KB register file for a total of 256 KB register file per SM.

 One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.

Doubling the processing speed for FP32 improves performance for a number of common graphics and compute operations and algorithms. Modern shader workloads typically have a mixture of FP32 arithmetic instructions such as FFMA, floating point additions (FADD), or floating point multiplications (FMUL), combined with simpler instructions such as integer adds for addressing and fetching data, floating point compare, or min/max for processing results, etc. Performance gains will vary at the shader and application level depending on the mix of instructions. Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.

Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.

Like prior NVIDIA GPUs, Ampere is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and memory controllers.

The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.


The four processing blocks share a combined 128 KB L1 data cache/shared memory. Traditional graphics workloads partition the 128 KB L1/shared memory as 64 KB of dedicated graphics shader RAM and 64 KB for texture cache and register file spill area. In compute mode, the GA10x SM will support the following configurations:

  • 128 KB L1 + 0 KB Shared Memory
  • 120 KB L1 + 8 KB Shared Memory
  • 112 KB L1 + 16 KB Shared Memory
  • 96 KB L1 + 32 KB Shared Memory
  • 64 KB L1 + 64 KB Shared Memory
  • 28 KB L1 + 100 KB Shared Memory

Ampere also ties its ROPs to the HPC and houses a total of 16 ROP units per GPC. The full GA102 GPU feature 112 ROPs while the GeForce RTX 3080 comes with a total of 96 ROPs.

The block diagram of the NVIDIA Ampere SM Gaming GPUs.

The entire SM works in harmony by using different blocks to deliver high performance and better texture caching, enabling for up to twice as better CUDA core performance when compared to the previous generation.

A block diagram of the GA102 GPU featured on the NVIDIA GeForce RTX 3080 graphics card.

Many of these Ampere SMs combine to form the Ampere GPU. Each TPC inside the Ampere GPU houses 2 Turing SMs which are linked to the raster engine. There are a total of 6 TPCs or 12 Ampere SM that are arranged inside the GPC or Graphics Processing Cluster. The top configured GA102 GPU comes with 7 GPCs with a total of 42 TPCs and 84 SMs that are connected to 10 MB of L1 and 6 MB of L2 cache, ROPs, TMUs, memory controllers, and NVLINK HighSpeed I/O hub. All of this combines to form the massive Ampere GA102 GPU. The following are some perf figures for the top Ampere graphics cards.

NVIDIA GeForce RTX 3090

  • 35.58 TFLOPS of peak single-precision (FP32) performance
  • 71.16 TFLOPS of peak half-precision (FP16) performance
  • 17.79 TIPS1 concurrent with FP, through independent integer execution units
  • 258 Tensor TFLOPS
  • 69 RT-TFLOPs

NVIDIA GeForce RTX 3080

  • 30 TFLOPS of peak single-precision (FP32) performance
  • 60 TFLOPS of peak half-precision (FP16) performance
  • 15 TIPS1 concurrent with FP, through independent integer execution units
  • 238 Tensor TFLOPS
  • 58 RT-TFLOPs

In terms of shading performance which is the direct result of the enhanced core design and GPU architecture revamp, the Ampere GPU offers an uplift of up to 70% better performance per core compared to Turing GPUs.

It should be pointed out that these are just per core performance gains at the same clock speeds without adding the benefits of other technologies that Ampere comes with. That would further increase the performance in a wide variety of gaming applications.

NVIDIA Ampere "GeForce RTX 30" GPUs Full Breakdown:

Graphics CardNVIDIA GeForce RTX 2070 SUPERNVIDIA GeForce RTX 3070NVIDIA GeForce RTX 2080NVIDIA GeForce RTX 3080NVIDIA Titan RTXNVIDIA GeForce RTX 3090
GPU CodenameTU106GA104TU104GA102TU102GA102
GPU ArchitectureNVIDIA TuringNVIDIA AmpereNVIDIA TuringNVIDIA AmpereNVIDIA TuringNVIDIA Ampere
GPCs5 or 666667
CUDA Cores / SM641286412864128
CUDA Cores / GPU2560588829448704460810496
Tensor Cores / SM8 (2nd Gen)4 (3rd Gen)8 (2nd Gen)4 (3rd Gen)8 (2nd Gen)4 (3rd Gen)
Tensor Cores / GPU320 (2nd Gen)184 (3rd Gen)368272 (3rd Gen)576 (2nd Gen)328 (3rd Gen)
RT Cores40 (1st Gen)46 (2nd Gen)46 (1st Gen)68 (2nd Gen)72 (1st Gen)82 (2nd Gen)
GPU Boost Clock (MHz)177017251800171017701695
Peak FP32 TFLOPS (non-Tensor)9.120.310.629.816.335.6
Peak FP16 TFLOPS (non-Tensor)18.120.321.229.832.635.6
Peak BF16 TFLOPS (non-Tensor)NA20.3NA29.8NA35.6
Peak INT32 TOPS (non-Tensor)
Peak FP16 Tensor TFLOPS
with FP16 Accumulate
Peak FP16 Tensor TFLOPS
with FP32 Accumulate
Peak BF16 Tensor TFLOPS
with FP32 Accumulate
Peak TF32 Tensor TFLOPSNA20.3/40.6NA29.8/59.5NA35.6/71
Peak INT8 Tensor TOPS145162.6/325.2169.6238/476261284/568
Peak INT4 Tensor TOPS290325.2/650.4339.1476/952522568/1136
Frame Buffer Memory Size and
Memory Interface256-bit256-bit256-bit320-bit384-bit384-bit
Memory Clock (Data Rate)14 Gbps14 Gbps14 Gbps19 Gbps14 Gbps19.5 Gbps
Memory Bandwidth448 GB/sec448 GB/sec448 GB/sec760 GB/sec672 GB/sec936 GB/sec
Pixel Fill-rate (Gigapixels/sec)113.3165.6115.2164.2169.9193
Texture Units160184184272288328
Texel Fill-rate (Gigatexels/sec)283.2317.4331.2465509.8566
L1 Data Cache/Shared Memory384058884416 KB8704 KB6912 KB10496 KB
L2 Cache Size4096 KB4096 KB4096 KB5120 KB6144 KB6144 KB
Register File Size10240 KB11776 KB11776 KB17408 KB18432 KB20992 KB
TGP (Total Graphics Power)215 Watts220W225W320W280W350W
Transistor Count13.6 Billion17.4 Billion13.6 Billion28.3 Billion18.6 Billion28.3 Billion
Die Size545 mm2392.5 mm2545 mm2628.4 mm2754mm2628.4 mm2
Manufacturing ProcessTSMC 12 nm FFN
Samsung 8 nm 8N NVIDIA
Custom Process
TSMC 12 nm FFN
Samsung 8 nm 8N NVIDIA
Custom Process
TSMC 12 nm FFN
Samsung 8 nm 8N NVIDIA
Custom Process

NVIDIA Ampere GPUs - GA102 & GA104 For The First Wave of Gaming Cards

NVIDIA is first introducing two brand new Ampere GPUs which include the GA102 and the GA104. The GA102 GPU is going to be featured on the GeForce RTX 3090 and GeForce RTX 3080 graphics cards while the GA104 GPU is going to be featured on the GeForce RTX 3070 graphics cards. The Ampere GPUs are based on the Samsung 8nm custom process node for NVIDIA and as such, the resultant GPU dies are slightly smaller than their Turing based predecessors but do come with a denser transistor layout. There will be several variations of each GPU featured across the RTX 30 series lineup. Following is what the complete GA102 and GA104 GPUs have to offer.


The full GA102 GPU is made up of 7 graphics processing clusters with 12 SM units on each cluster. That makes up 84 SM units for a total of 10752 cores in a 28.3 billion transistor package measuring 628.4mm2.


The full GA104 GPU is made up of 6 graphics processing clusters with 8 SM units on each cluster. That makes up 48 SM units for a total of 6144 cores in a 17.4 billion transistor package measuring 392.5mm2.

Filter videos by