NVIDIA Ampere GA100 ‘Worlds Biggest 7nm GPU’ Official – Full Architecture Deep Dive, 8192 Cores, 48 GB HBM2, 20X Faster Than Volta

May 14, 2020 at 02:08pm EDT

NVIDIA has officially lifted the curtains off its greatest and most powerful GPU to date, the 7nm Ampere GPU. The first product to feature the new Ampere architecture is a GPU called GA100 & this chip is currently the largest GPU to be produced on the bleeding edge TSMC's 7nm process node. Today, we will be taking a deep-dive in the Ampere GA100 GPU architecture, specifications & the first products that it would be featured inside.

NVIDIA's Ampere GA100 GPU Official - World's Biggest 7nm GPU With Insane Specs

The Ampere GA100 GPU is by far the largest 7nm GPU ever designed. The GPU is designed entirely for the HPC market with applications such as scientific research, Artificial Intelligence, Deep Neural Networking, and AI Inferencing. There's a lot of specifications and a lot of products to talk about so let's start.

Related Story NVIDIA’s Five-Year-Old RTX 3060 12GB Returns At $325 In China, Undercutting RTX 5060 On VRAM But Not Performance

First of all, the NVIDIA Ampere GA100 GPU will be available in various form factors. Ranging from a singular Mezzanine Modular card to full-length PCIe 4.0 graphics card form factors. The GPU also comes in various configurations but the one NVIDIA is highlighting today is the Tesla A100 which is used on the DGX A100 and HGX A100 system.

The NVIDIA 7nm Ampere GA100 GPU Architecture & Specifications

When it comes to core specifications, the Ampere GA100 GPU from NVIDIA is a complete monster. Measuring in at a massive 826mm2 which is even bigger than the Volta GV100 GPU which was 815 mm2. The GPU also features more than twice the number of transistors at 54 Billion versus 21.1 on its predecessor which is very impressive. Given the die size and the transistor count, the Ampere GA100 GPU is single-handily the densest GPU ever built.

The full implementation of the NVIDIA Ampere GA100 GPU includes the following units:

The A100 Tensor Core GPU implementation of the NVIDIA Ampere GA100 GPU includes the following units:

Figure 4 shows a full GA100 GPU with 128 SMs. The A100 is base

While the Tesla A100 features cut-down specifications due to early 7nm yields which are still great considering the size of this 'SUPER GPU', the NVIDIA Ampere GA100 GPU in its full-fat version is what we're going to be looking at first.

Featuring 128 SMs with 8192 CUDA cores, the NVIDIA Ampere GA100 also houses the largest single GPU core count we've ever seen. It comes with 8192 FP32 cores, 4096 FP64 cores, and 512 tensor cores. There are 8 Graphics Processing Clusters on the GPU, each with 16 SM units and 8 TPCs. The GA100 GPU has a TDP of 400W for its Tesla A100 variant.

The NVIDIA A100 GPU is a technical design breakthrough fueled by five key innovations:

Other specifications for the NVIDIA Ampere GA100 GPU include a huge 6144-bit bus interface which features up to 48 GB HBM2e memory in six HBM2 stacks that are scattered around the GPU die. Each stack has 2 GB VRAM capacity per die so to reach 48 GB, you would need 4-hi stacks. Each 4-hi stack would consist of 8GB capacity and 6 stacks equal 48 GB capacity. The memory is stated to be running at over 2.0 Gbps pin speeds which would result in around 1.6 Tbps bandwidth.

The NVIDIA Ampere GPU will come with several HBM memory configurations but it maxes out at 48 GB unless NVIDIA wants to offer a 6-hi or 8-hi variant in the future which would raise the memory capacity to 72 or even 96 GB. NVIDIA's Tesla V100S already double the HBM capacity of the Tesla V100, offering 32 GB vs 16 GB so it's entirely possible NVIDIA could do the same with a future variant of the Tesla A100.

NVIDIA Ampere GA100 GPU Block Diagram:

NVIDIA Ampere GA100 GPU SM Block Diagram:

NVIDIA Ampere GH100 Compute

GPUKepler GK110Maxwell GM200Pascal GP100Volta GV100Ampere GA100Hopper GH100
Compute Capability3.55.36.07.08.09/0
Threads / Warp323232323232
Max Warps / Multiprocessor646464646464
Max Threads / Multiprocessor204820482048204820482048
Max Thread Blocks / Multiprocessor163232323232
Max 32-bit Registers / SM655366553665536655366553665536
Max Registers / Block655363276865536655366553665536
Max Registers / Thread255255255255255255
Max Thread Block Size102410241024102410241024
CUDA Cores / SM192128646464128
Shared Memory Size / SM Configurations (bytes)16K/32K/48K96K64K96K164K228K

The NVIDIA Tesla A100 Accelerator - Specs & Performance

With the specifications of the full-fat NVIDIA Ampere GA100 GPU covered, let's talk about the Tesla A100 graphics accelerator itself. The Tesla A100 makes use of a cut-down variant of the Ampere GA100 GPU that offers 108 SMs featuring 6912 FP32 cores, 3456 FP64 cores, and 432 Tensor cores. The card comes with a 5120-bit bus interface with a maximum VRAM capacity of 40 GB HBM2. It is interesting here because 40 GB HBM2 would suggest either a 5-hi stack design which seems unlikely or a 6-hi stack with a defective DRAM chip on each stack. In the case of the former, a spacer would be introduced on the GA100 HBM stack to fill up its space.

The NVIDIA Ampere Tesla A100 features a 400W TDP which is 100W more than the Tesla V100 Mezzanine unit. The PCIe variant comes with a 300W TDP but has lowered down clock speeds. The Mezzanine board has a GPU-to-GPU connection through the new NVLINK switches which enables up to 600 Gb/s GPU-To-GPU interconnect and 4.8 Tb/s bi-directional channel. The PCIe variant has a Mellanox switch on board along with two next-gen NVLINK connections and two EDR ports.

V100 A100 A100  Sparsity1  A100 Speedup A100 Speedup with Sparsity
A100 FP16 vs. V100 FP16  31.4 TFLOPS 78 TFLOPS N/A 2.5x N/A
A100 FP16 TC vs. V100 FP16 TC 125 TFLOPS 312 TFLOPS 624 TFLOPS 2.5x 5x
A100 BF16 TC vs.V100 FP16 TC 125 TFLOPS 312 TFLOPS 624 TFLOPS 2.5x 5x
A100 FP32 vs. V100 FP32 15.7 TFLOPS 19.5 TFLOPS N/A 1.25x N/A
A100 TF32 TC vs. V100 FP32  15.7 TFLOPS 156 TFLOPS 312 TFLOPS 10x 20x
A100 FP64 vs. V100 FP64 7.8 TFLOPS 9.7 TFLOPS N/A 1.25x N/A
A100 FP64 TC vs. V100 FP64 7.8 TFLOPS 19.5 TFLOPS N/A 2.5x N/A
A100 INT8 TC vs. V100 INT8 62 TOPS 624 TOPS 1248 TOPS 10x 20x
A100 INT4 TC N/A 1248 TOPS 2496 TOPS N/A N/A
A100 Binary TC N/A 4992 TOPS N/A N/A N/A

In terms of performance, the NVIDIA Ampere GA100 GPU delivers 1 Peta-OPs which is a 20x increase over the Volta GV100 GPU. The double-precision performance is rated at 2.5x higher over NVIDIA's Volta GV100 GPU which should end up somewhere around 19.5 TFLOPs FP64 since Volta had around 8 TFLOPs FP64 compute power. This would mean that the single-precision performance is rated at over 19.5 standard rates and up to 156 TFLOPs (FP32) which would be mind-blowing for the HPC segment.

NVIDIA HPC / AI GPUs

NVIDIA Tesla Graphics CardNVIDIA B200NVIDIA H200 (SXM5)NVIDIA H100 (SMX5)NVIDIA H100 (PCIe)NVIDIA A100 (SXM4)NVIDIA A100 (PCIe4)Tesla V100S (PCIe)Tesla V100 (SXM2)Tesla P100 (SXM2)Tesla P100
(PCI-Express)
Tesla M40
(PCI-Express)
Tesla K40
(PCI-Express)
GPUB200H200 (Hopper)H100 (Hopper)H100 (Hopper)A100 (Ampere)A100 (Ampere)GV100 (Volta)GV100 (Volta)GP100 (Pascal)GP100 (Pascal)GM200 (Maxwell)GK110 (Kepler)
Process Node4nm4nm4nm4nm7nm7nm12nm12nm16nm16nm28nm28nm
Transistors208 Billion80 Billion80 Billion80 Billion54.2 Billion54.2 Billion21.1 Billion21.1 Billion15.3 Billion15.3 Billion8 Billion7.1 Billion
GPU Die SizeTBD814mm2814mm2814mm2826mm2826mm2815mm2815mm2610 mm2610 mm2601 mm2551 mm2
SMs160132132114108108808056562415
TPCs806666575454404028282415
L2 Cache SizeTBD51200 KB51200 KB51200 KB40960 KB40960 KB6144 KB6144 KB4096 KB4096 KB3072 KB1536 KB
FP32 CUDA Cores Per SMTBD128128128646464646464128192
FP64 CUDA Cores / SMTBD128128128323232323232464
FP32 CUDA CoresTBD16896168961459269126912512051203584358430722880
FP64 CUDA CoresTBD16896168961459234563456256025601792179296960
Tensor CoresTBD528528456432432640640N/AN/AN/AN/A
Texture UnitsTBD528528456432432320320224224192240
Boost ClockTBD~1850 MHz~1850 MHz~1650 MHz1410 MHz1410 MHz1601 MHz1530 MHz1480 MHz1329MHz1114 MHz875 MHz
TOPs (DNN/AI)20,000 TOPs3958 TOPs3958 TOPs3200 TOPs2496 TOPs2496 TOPs130 TOPs125 TOPsN/AN/AN/AN/A
FP16 Compute10,000 TFLOPs1979 TFLOPs1979 TFLOPs1600 TFLOPs624 TFLOPs624 TFLOPs32.8 TFLOPs30.4 TFLOPs21.2 TFLOPs18.7 TFLOPsN/AN/A
FP32 Compute90 TFLOPs67 TFLOPs67 TFLOPs800 TFLOPs156 TFLOPs
(19.5 TFLOPs standard)
156 TFLOPs
(19.5 TFLOPs standard)
16.4 TFLOPs15.7 TFLOPs10.6 TFLOPs10.0 TFLOPs6.8 TFLOPs5.04 TFLOPs
FP64 Compute45 TFLOPs34 TFLOPs34 TFLOPs48 TFLOPs19.5 TFLOPs
(9.7 TFLOPs standard)
19.5 TFLOPs
(9.7 TFLOPs standard)
8.2 TFLOPs7.80 TFLOPs5.30 TFLOPs4.7 TFLOPs0.2 TFLOPs1.68 TFLOPs
Memory Interface8192-bit HBM45120-bit HBM3e5120-bit HBM35120-bit HBM2e6144-bit HBM2e6144-bit HBM2e4096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM2384-bit GDDR5384-bit GDDR5
Memory SizeUp To 192 GB HBM3 @ 8.0 GbpsUp To 141 GB HBM3e @ 6.5 GbpsUp To 80 GB HBM3 @ 5.2 GbpsUp To 94 GB HBM2e @ 5.1 GbpsUp To 40 GB HBM2 @ 1.6 TB/s
Up To 80 GB HBM2 @ 1.6 TB/s
Up To 40 GB HBM2 @ 1.6 TB/s
Up To 80 GB HBM2 @ 2.0 TB/s
16 GB HBM2 @ 1134 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 732 GB/s
12 GB HBM2 @ 549 GB/s
24 GB GDDR5 @ 288 GB/s12 GB GDDR5 @ 288 GB/s
TDP700W700W700W350W400W250W250W300W300W250W250W235W

The NVIDIA Ampere GA100 GPU Hardware

NVIDIA's Tesla V100 which is based on the Ampere GA100 GPU will be powering the company's both DGX & HGX stations. The DGX stations focus purely on AI research and HPC workloads and HGX stations focus their prowess at cloud computing and datacenter environments. The systems being introduced by NVIDIA include the 3rd Generation DGX-A100 and the HGX-A100.

NVIDIA's partners have already announced their new 1U, 2U, 4U, and up to 10U GPU servers. Each server is outfitted with up to 8 NVIDIA Ampere GA100 based Tesla A100 boards, making use of PCIe Gen 4.0 x16 links. An HGX A100 4 GPU board is also available for improved performance while keeping the costs to a more affordable range.

Finally, NVIDIA will be announcing its next-generation DGX-A100 system which Jensen Huang teased a few days ago. The DGX-A100 will deliver 5 Petaflops of peak performance with eight NVIDIA Ampere based Tesla A100 GPUs.

The system itself is 20x faster than the previous DGX based on NVIDIA's Volta GPU architecture. The reference cluster design features 140 DGX-A100 GPUs with a 200 Gbps Mellanox Infiniband interconnect. The NVIDIA Ampere powered DGX-A100 system is going to start at $199,000 and is shipping as of today.

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.