NVIDIA has officially lifted the curtains off its greatest and most powerful GPU to date, the 7nm Ampere GPU. The first product to feature the new Ampere architecture is a GPU called GA100 & this chip is currently the largest GPU to be produced on the bleeding edge TSMC's 7nm process node. Today, we will be taking a deep-dive in the Ampere GA100 GPU architecture, specifications & the first products that it would be featured inside. NVIDIA's Ampere GA100 GPU Official - World's Biggest 7nm GPU With Insane Specs

The Ampere GA100 GPU is by far the largest 7nm GPU ever designed. The GPU is designed entirely for the HPC market with applications such as scientific research, Artificial Intelligence, Deep Neural Networking, and AI Inferencing. There's a lot of specifications and a lot of products to talk about so let's start.

NVIDIA Ampere GA100 GPU Powered Tesla A100: Worlds Largest 7nm GPU, 54 Billion Transistors, 1 Petaflops Compute & Up To 96 GB HBM2 Memory

First of all, the NVIDIA Ampere GA100 GPU will be available in various form factors. Ranging from a singular Mezzanine Modular card to full-length PCIe 4.0 graphics card form factors. The GPU also comes in various configurations but the one NVIDIA is highlighting today is the Tesla A100 which is used on the DGX A100 and HGX A100 system.

The Ampere GA100 GPU Architecture & Specifications

When it comes to core specifications, the Ampere GA100 GPU from NVIDIA is a complete monster. Measuring in at a massive 826mm2 which is even bigger than the Volta GV100 GPU which was 815 mm2. The GPU also features more than twice the number of transistors at 54 Billion versus 21.1 on its predecessor which is very impressive. Given the die size and the transistor count, the Ampere GA100 GPU is single-handily the densest GPU ever built.

While the Tesla A100 features cut-down specifications due to early 7nm yields which are still great considering the size of this 'SUPER GPU', the Ampere GA100 GPU in its full-fat version is what we're going to be looking at first.

Featuring 128 SMs with 8192 CUDA cores, the Ampere GA100 also houses the largest single GPU core count we've ever seen. It comes with 8192 FP32 cores, 4096 FP64 cores, and 512 tensor cores. There are 8 Graphics Processing Clusters on the GPU, each with 16 SM units and 8 TPCs. The GA100 GPU has a TDP of 400W for its Tesla A100 variant.

Other specifications include a huge 6144-bit bus interface which features up to 48 GB HBM2e memory in six HBM2 stacks that are scattered around the GPU die. Each stack has 2 GB VRAM capacity per die so to reach 48 GB, you would need 4-hi stacks. Each 4-hi stack would consist of 8GB capacity and 6 stacks equal 48 GB capacity. The memory is stated to be running at 3.2 Gbps pin speeds which would result in around 2.5 Tbps bandwidth.

NVIDIA Teases Ampere GPU Powered DGX A100 Supercomputing System Ahead of GTC 2020, Calls It The Worlds Largest Graphics Card!

The GPU will come with several HBM memory configurations but it maxes out at 48 GB unless NVIDIA wants to offer a 6-hi or 8-hi variant in the future which would raise the memory capacity to 72 or even 96 GB. NVIDIA's Tesla V100S already double the HBM capacity of the Tesla V100, offering 32 GB vs 16 GB so it's entirely possible NVIDIA could do the same with a future variant of the Tesla A100.

The Tesla A100 Accelerator - Specs & Performance

With the specifications of the full-fat GA100 GPU covered, let's talk about the Tesla A100 graphics accelerator itself. The Tesla A100 makes use of a cut-down variant of the Ampere GA100 GPU that offers 108 SMs featuring 6912 FP32 cores, 3456 FP64 cores, and 432 Tensor cores. The card comes with a 5120-bit bus interface with a maximum VRAM capacity of 40 GB HBM2. It is interesting here because 40 GB HBM2 would suggest either a 5-hi stack design which seems unlikely or a 6-hi stack with a defective DRAM chip on each stack. In the case of the former, a spacer would be introduced on the GA100 HBM stack to fill up its space.

The card features a 400W TDP which is 100W more than the Tesla V100 Mezzanine unit. The PCIe variant comes with a 300W TDP but has lowered down clock speeds. The Mezzanine board has a GPU-to-GPU connection through the new NVLINK switches which enables up to 600 Gb/s GPU-To-GPU interconnect and 4.8 Tb/s bi-directional channel. The PCIe variant has a Mellanox switch on board along with two next-gen NVLINK connections and two EDR ports.

In terms of performance, the Ampere GA100 GPU delivers 1 Peta-OPs which is a 20x increase over the Volta GV100 GPU. The double-precision performance is rated at 2.5x higher over NVIDIA's Volta GV100 GPU which should end up somewhere around 20 TFLOPs FP64 since Volta features around 8 TFLOPs FP64 compute power. This would mean that the single-precision performance is rated at over 40 TFLOPs (FP32) which would be mind-blowing for the HPC segment.

NVIDIA Ampere GA100 GPU Based Tesla A100 Specs:

NVIDIA Tesla Graphics Card Tesla K40

(PCI-Express) Tesla M40

(PCI-Express) Tesla P100

(PCI-Express) Tesla P100 (SXM2) Tesla V100 (SXM2) Tesla V100S (PCIe) Tesla A100 (SXM3) GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal) GV100 (Volta) GV100 (Volta) GA100 (Ampere) Process Node 28nm 28nm 16nm 16nm 12nm 12nm 7nm Transistors 7.1 Billion 8 Billion 15.3 Billion 15.3 Billion 21.1 Billion 21.1 Billion 54 Billion GPU Die Size 551 mm2 601 mm2 610 mm2 610 mm2 815mm2 815mm2 826mm2 SMs 15 24 56 56 80 80 108 TPCs 15 24 28 28 40 40 TBD CUDA Cores Per SM 192 128 64 64 64 64 TBD CUDA Cores (Total) 2880 3072 3584 3584 5120 5120 6912 Texture Units 240 192 224 224 320 320 TBD FP64 CUDA Cores / SM 64 4 32 32 32 32 TBD FP64 CUDA Cores / GPU 960 96 1792 1792 2560 2560 3456 Base Clock 745 MHz 948 MHz 1190 MHz 1328 MHz 1297 MHz TBD TBD Boost Clock 875 MHz 1114 MHz 1329MHz 1480 MHz 1530 MHz 1601 MHz TBD FP16 Compute N/A N/A 18.7 TFLOPs 21.2 TFLOPs 30.4 TFLOPs 32.8 TFLOPs 624 TOPs (INT8)

1248 TOPS (INT4) FP32 Compute 5.04 TFLOPs 6.8 TFLOPs 10.0 TFLOPs 10.6 TFLOPs 15.7 TFLOPs 16.4 TFLOPs 156 TFLOPs

(19.5 TFLOPs standard) FP64 Compute 1.68 TFLOPs 0.2 TFLOPs 4.7 TFLOPs 5.30 TFLOPs 7.80 TFLOPs 8.2 TFLOPs 19.5 TFLOPs

(9.7 TFLOPs standard) TOPs (DNN/AI) N/A N/A N/A N/A 125 TOPs 130 TOPs >1000 TOPs Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 6144-bit HBM2e Memory Size 12 GB GDDR5 @ 288 GB/s 24 GB GDDR5 @ 288 GB/s 16 GB HBM2 @ 732 GB/s

12 GB HBM2 @ 549 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 900 GB/s 16 GB HBM2 @ 1134 GB/s 40 GB HBM2 @ 1.6 TB/s L2 Cache Size 1536 KB 3072 KB 4096 KB 4096 KB 6144 KB 6144 KB TBD TDP 235W 250W 250W 300W 300W 250W 400W

The Ampere GA100 GPU Hardware

NVIDIA's Tesla V100 which is based on the Ampere GA100 GPU will be powering the company's both DGX & HGX stations. The DGX stations focus purely on AI research and HPC workloads and HGX stations focus their prowess at cloud computing and datacenter environments. The systems being introduced by NVIDIA include the 3rd Generation DGX-A100 and the HGX-A100.

NVIDIA's partners have already announced their new 1U, 2U, 4U, and up to 10U GPU servers. Each server is outfitted with up to 8 NVIDIA Tesla V100 boards, making use of PCIe Gen 4.0 x16 links. An HGX A100 4 GPU board is also available for improved performance while keeping the costs to a more affordable range.