NVIDIA Ada GPU - Ada Streaming Multiprocessor, Ada GPC &; Ada GPUs Deep Dive

Let's take a trip down the journey to Ada. In 2016, NVIDIA announced their Pascal GPUs which would soon be featured in their top to bottom GeForce 10 series lineup. After the launch of Maxwell, NVIDIA gained a lot of experience in the efficiency department which they put a focus on since their Kepler GPUs.

Four years ago, NVIDIA, rather than offering another standard leap in the rasterization performance of its GPUs took a different approach & introduced two key technologies in its Turing line of consumer GPUs, one being AI-assisted acceleration with the Tensor Cores and the second being hardware-level acceleration for Ray Tracing with its brand new RT cores.

Related StoryHassan Mujtaba
NVIDIA GeForce RTX 4060 Ti May Only Be As Fast As The RTX 3070

Then came Ampere with its brand new Samsung 8nm fabrication process, NVIDIA added even more to its gaming graphics lineup. In the Ampere GPU architecture, NVIDIA provided its latest Ampere SM along with next-gen FP32, INT32, Tensor Cores, and RT cores. The focus was to boost both rasterization and ray tracing capabilities to new heights.

Now enter Ada, a brand new architecture that aims to take everything from the first two RTX GPUs and perfect it. The graphics architecture is designed for speed and that it excels at. So let's see the architecture in detail. Following are the few main highlights of the Ada Lovelace GPU architecture:

  • Revolutionary New Architecture: NVIDIA Ada architecture GPUs deliver outstanding performance for graphics, AI, and compute workloads with exceptional architectural and power efficiency. After the baseline design for the Ada SM was established, the chip was scaled up to shatter records. Manufacturing innovations and materials research enabled NVIDIA engineers to craft a GPU with 76.3 billion transistors and 18,432 CUDA Cores capable of running at clocks over 2.5 GHz while maintaining the same 450W TGP as the prior generation flagship GeForce RTX 3090 Ti GPU. The result is the world’s fastest GPU with the power, acoustics, and temperature characteristics expected of a high-end graphics card.
  • New Ada RT Core for Faster Ray Tracing: For decades, rendering ray-traced scenes with physically correct lighting in real-time has been considered the holy grail of graphics. At the same time, the geometric complexity of environments and objects continues to increase as 3D games and graphics continually strive to provide the most accurate representations of the real world. The Ada RT Core has been enhanced to deliver 2x faster ray-triangle intersection testing and includes two important new hardware units. An Opacity Micro map Engine speeds up ray tracing of alpha-tested geometry by a factor of 2x, and a Displaced Micro-Mesh Engine generates Displaced Micro-Triangles on-the-fly to create additional geometry. The Micro-Mesh Engine provides the benefit of increased geometric complexity without the traditional performance and storage costs of complex geometries.
  • Shader Execution Reordering: NVIDIA Ada GPUs support Shader Execution Reordering which dynamically organizes & reorders shading workloads to improve RT shading Introduction efficiency. This improves performance by up to 44% in Cyberpunk 2077 with Ray Tracing Overdrive Mode.
  • NVIDIA DLSS 3: The Ada architecture features an all-new Optical Flow Accelerator and AI frame generation that boosts DLSS 3’s frame rates up to 2x over the previous DLSS 2.0 while maintaining or exceeding native image quality. Compared to traditional brute-force graphics rendering, DLSS 3 is ultimately up to 4x faster while providing low system latency.

The NVIDIA Ada Lovelace AD102 GPU features up to 12 GPC (Graphics Processing Clusters). These are 5 more SMs compared to the Ampere GA102 GPUs. Each GPU will consist of 6 TPCs and 2 SMs which is the same configuration as the existing chip. Each SM (Streaming Multiprocessor) will house four sub-cores which is also the same as the GA102 GPU. What's changed is the FP32 & the INT32 core configuration. Each sub-core will include 64 FP32 units but combined FP32+INT32 units will go up to 128. This is because half of the FP32 units don't share the same sub-core as the IN32 units. The 64 FP32 cores are separate from the 128 INT32 cores.

So in total, each sub-core will consist of 16 FP32 plus 16 INT32 units for a total of 32 units. Each SM will have a total of 64 FP32 units plus 64 INT32 units for a total of 128 units. And since there are a total of 144 SM units (12 per GPC), we are looking at a total of 18,432 cores. Each SM will also include two Wrap Schedules (32 thread/CLK) for 64 wraps per SM & their own L0 i-cache. This is a 33% increase in Wraps/Threads vs the GA102 GPU. The Register file size is 16,384 across a 32-bit lane. Each SM also carries its own 128 KB of L1 data cache and shared memory so that's 18 MB of L1 cache.

Moving over to the cache, this is another segment where NVIDIA has given a big boost over the existing Ampere GPUs. The L2 cache will be increased to 96 MB as mentioned in the leaks. This is a 16x increase over the Ampere GPU that hosts just 6 MB of L2 cache. The cache will be shared across the GPU. The GPU will also feature up to 192 ROPs for the full-die.

There are also going to be the latest 4th Generation Tensor and 3rd Generation RT (Raytracing) cores infused on the Ada Lovelace GPUs which will help boost DLSS & Raytracing performance to the next level. Overall, the Ada Lovelace AD102 GPU will offer:

  • 71% More GPCs (Versus Ampere)
  • 71% More Cores (Versus Ampere)
  • 50% More L1 Cache (Versus Ampere)
  • 16x More L2 Cache (Versus Ampere)
  • 71% More ROPs (Versus Ampere)
  • 4th Gen Tensor & 3rd Gen RT Cores

The full die has not been featured on any GPU so far, not even the L40 which has 2 SMs disabled. It is likely that as yields progress, we will eventually see a gaming and workstation product using the full-fat AD102. Till then, the RTX 4090 is the top gaming graphics card while the RTX 6000 Ada is the top workstation solution.

NVIDIA AD102 'Ada Lovelace' Gaming GPU Block Diagram:

NVIDIA AD102 'Ada Lovelace' Gaming GPU 'SM' Block Diagram:

NVIDIA GeForce RTX 4090

  • 82.6 TFLOPS of peak single-precision (FP32) performance
  • 165.2 TFLOPS of peak half-precision (FP16) performance
  • 660.6 Tensor TFLOPS
  • 1321.2 Tensor TFLOPs with sparsity
  • 191 RT-TFLOPs

At the heart of the NVIDIA GeForce RTX 4090 graphics card lies the Ada Lovelace AD102 GPU. The GPU measures 608,4mm2 and will utilize the TSMC 4N process node which is an optimized version of TSMC's 5nm (N5) node designed for the green team. The GPU features an insane 76.3 Billion transistors.

NVIDIA Ampere "GeForce RTX 30" GPUs Full Breakdown:

Graphics CardNVIDIA GeForce RTX 2070 SUPERNVIDIA GeForce RTX 3070NVIDIA GeForce RTX 2080NVIDIA GeForce RTX 3080NVIDIA Titan RTXNVIDIA GeForce RTX 3090
GPU CodenameTU106GA104TU104GA102TU102GA102
GPU ArchitectureNVIDIA TuringNVIDIA AmpereNVIDIA TuringNVIDIA AmpereNVIDIA TuringNVIDIA Ampere
GPCs5 or 666667
TPCs202323343641
SMs404646687282
CUDA Cores / SM641286412864128
CUDA Cores / GPU2560588829448704460810496
Tensor Cores / SM8 (2nd Gen)4 (3rd Gen)8 (2nd Gen)4 (3rd Gen)8 (2nd Gen)4 (3rd Gen)
Tensor Cores / GPU320 (2nd Gen)184 (3rd Gen)368272 (3rd Gen)576 (2nd Gen)328 (3rd Gen)
RT Cores40 (1st Gen)46 (2nd Gen)46 (1st Gen)68 (2nd Gen)72 (1st Gen)82 (2nd Gen)
GPU Boost Clock (MHz)177017251800171017701695
Peak FP32 TFLOPS (non-Tensor)9.120.310.629.816.335.6
Peak FP16 TFLOPS (non-Tensor)18.120.321.229.832.635.6
Peak BF16 TFLOPS (non-Tensor)NA20.3NA29.8NA35.6
Peak INT32 TOPS (non-Tensor)9.110.210.614.916.317.8
Peak FP16 Tensor TFLOPS
with FP16 Accumulate
72.581.3/162.684.8119/238130.5142/284
Peak FP16 Tensor TFLOPS
with FP32 Accumulate
36.340.6/81.342.459.5/11965.271/142
Peak BF16 Tensor TFLOPS
with FP32 Accumulate
NA40.6/81.3NA59.5/119NA71/142
Peak TF32 Tensor TFLOPSNA20.3/40.6NA29.8/59.5NA35.6/71
Peak INT8 Tensor TOPS145162.6/325.2169.6238/476261284/568
Peak INT4 Tensor TOPS290325.2/650.4339.1476/952522568/1136
Frame Buffer Memory Size and
Type
8 GB GDDR68 GB GDDR68 GB GDDR610 GB GDDR6X24 GB GDDR624 GB GDDR6X
Memory Interface256-bit256-bit256-bit320-bit384-bit384-bit
Memory Clock (Data Rate)14 Gbps14 Gbps14 Gbps19 Gbps14 Gbps19.5 Gbps
Memory Bandwidth448 GB/sec448 GB/sec448 GB/sec760 GB/sec672 GB/sec936 GB/sec
ROPs6496649696112
Pixel Fill-rate (Gigapixels/sec)113.3165.6115.2164.2169.9193
Texture Units160184184272288328
Texel Fill-rate (Gigatexels/sec)283.2317.4331.2465509.8566
L1 Data Cache/Shared Memory384058884416 KB8704 KB6912 KB10496 KB
L2 Cache Size4096 KB4096 KB4096 KB5120 KB6144 KB6144 KB
Register File Size10240 KB11776 KB11776 KB17408 KB18432 KB20992 KB
TGP (Total Graphics Power)215 Watts220W225W320W280W350W
Transistor Count13.6 Billion17.4 Billion13.6 Billion28.3 Billion18.6 Billion28.3 Billion
Die Size545 mm2392.5 mm2545 mm2628.4 mm2754mm2628.4 mm2
Manufacturing ProcessTSMC 12 nm FFN
(FinFET NVIDIA)
Samsung 8 nm 8N NVIDIA
Custom Process
TSMC 12 nm FFN
(FinFET NVIDIA)
Samsung 8 nm 8N NVIDIA
Custom Process
TSMC 12 nm FFN
(FinFET NVIDIA)
Samsung 8 nm 8N NVIDIA
Custom Process

NVIDIA Ada GPUs - AD102, AD103, AD104 For The First Wave of Gaming Cards

NVIDIA is first introducing three brand new Ada GPUs which include the AD102, AD103 & AD104. The AD102 GPU is going to be featured on the GeForce RTX 4090, the AD103 is going to be used by the GeForce RTX 4080 16 GB graphics cards and the AD104 GPU is going to be featured on the GeForce RTX 4080 12 GB graphics cards.

The Ada GPUs are based on the TSMC 4N process node which is a custom process designed exclusively for NVIDIA. It is essentially an optimized version of the N5 (5nm) process, offering drastic increases in transistors, cores, and frequency. The top AD102 GPU packs 70% more cores and also offers 76.3 Billion transistors while offering over 2x the performance per watt.

NVIDIA Ada AD102 GPU

The full AD102 GPU is made up of 12 graphics processing clusters with 12 SM units on each cluster. That makes up 144 SM units for a total of 18432 cores, 144 RT cores, 576 Tensor Cores, 576 Texture Units, and a 384-bit bus interface in a 76.3 billion transistor package measuring 608,5mm2.

Products mentioned in this post

AMD Ryzen
USD 340

The links above are affiliate links. As an Amazon Associate, Wccftech.com may earn from qualifying purchases.

Filter videos by
Order