NVIDIA A100 Ampere GPU Launched in PCIe Form Factor, 20 Times Faster Than Volta at 250W & 40 GB HBM2 Memory

Jun 22, 2020 at 08:58am EDT

NVIDIA has added a third variant to its growing Ampere A100 GPU family, the A100 PCIe which is PCIe 4.0 compliant and comes in the standard full-length, full height form factor compared to the mezzanine board we got to see earlier.

NVIDIA's A100 Ampere GPU Gets PCIe 4.0 Ready Form Factor - Same GPU Configuration But at 250W, Up To 90% Performance of the Full 400W A100 GPU

Just like the Pascal P100 and Volta V100 before it, the Ampere A100 GPU was bound to get a PCIe variant sooner or later. Now NVIDIA has announced that its A100 PCIe GPU accelerator is available for a diverse set of use cases with system ranging from a single A100 PCIe GPU to servers utilizing two cards at the same time through the 12 NVLINK channels that deliver 600 GB/s of interconnect bandwidth.

Related Story AMD Believes Unified Memory Architectures Open Up a “World of Possibilities”, Will Shape Their Product Choices & Roadmaps In Future

In terms of specifications, the A100 PCIe GPU accelerator doesn't change much in terms of core configuration. The GA100 GPU retains the specifications we got to see on the 400W variant with 6912 CUDA cores arranged in 108 SM units, 432 Tensor Cores and 40 GB of HBM2 memory that delivers the same memory bandwidth of 1.55 TB/s (rounded off to 1.6 TB/s). The main difference can be seen in the TDP which is rated at 250W for the PCIe variant whereas the standard variant comes with a 400W TDP.

Now we can guess that the card would feature lower clocks to compensate for the less TDP input but NVIDIA has provided the peak compute numbers and those remain unaffected for the PCIe variant. The FP64 performance is still rated at 9.7/19.5 TFLOPs, FP32 performance is rated at 19.5 /156/312 TFLOPs (Sparsity), FP16 performance is rated at 312/624 TFLOPs (Sparsity) & INT8 is rated at 624/1248 TOPs (Sparsity).

According to NVIDIA, the A100 PCIe accelerator can deliver 90% the performance of the A100 HGX card (400W) in top server applications. This is mainly due to the less time it takes for the card to achieve the said tasks however, in complex situations which required sustained GPU capabilities, the GPU can deliver anywhere from up to 90% to down to 50% the performance of the 400W GPU in the most extreme cases. NVIDIA told that the 50% drop will be very rare and only a few tasks can push the card to such extend.

NVIDIA HPC / AI GPUs

NVIDIA Tesla Graphics CardNVIDIA B200NVIDIA H200 (SXM5)NVIDIA H100 (SMX5)NVIDIA H100 (PCIe)NVIDIA A100 (SXM4)NVIDIA A100 (PCIe4)Tesla V100S (PCIe)Tesla V100 (SXM2)Tesla P100 (SXM2)Tesla P100
(PCI-Express)
Tesla M40
(PCI-Express)
Tesla K40
(PCI-Express)
GPUB200H200 (Hopper)H100 (Hopper)H100 (Hopper)A100 (Ampere)A100 (Ampere)GV100 (Volta)GV100 (Volta)GP100 (Pascal)GP100 (Pascal)GM200 (Maxwell)GK110 (Kepler)
Process Node4nm4nm4nm4nm7nm7nm12nm12nm16nm16nm28nm28nm
Transistors208 Billion80 Billion80 Billion80 Billion54.2 Billion54.2 Billion21.1 Billion21.1 Billion15.3 Billion15.3 Billion8 Billion7.1 Billion
GPU Die SizeTBD814mm2814mm2814mm2826mm2826mm2815mm2815mm2610 mm2610 mm2601 mm2551 mm2
SMs160132132114108108808056562415
TPCs806666575454404028282415
L2 Cache SizeTBD51200 KB51200 KB51200 KB40960 KB40960 KB6144 KB6144 KB4096 KB4096 KB3072 KB1536 KB
FP32 CUDA Cores Per SMTBD128128128646464646464128192
FP64 CUDA Cores / SMTBD128128128323232323232464
FP32 CUDA CoresTBD16896168961459269126912512051203584358430722880
FP64 CUDA CoresTBD16896168961459234563456256025601792179296960
Tensor CoresTBD528528456432432640640N/AN/AN/AN/A
Texture UnitsTBD528528456432432320320224224192240
Boost ClockTBD~1850 MHz~1850 MHz~1650 MHz1410 MHz1410 MHz1601 MHz1530 MHz1480 MHz1329MHz1114 MHz875 MHz
TOPs (DNN/AI)20,000 TOPs3958 TOPs3958 TOPs3200 TOPs2496 TOPs2496 TOPs130 TOPs125 TOPsN/AN/AN/AN/A
FP16 Compute10,000 TFLOPs1979 TFLOPs1979 TFLOPs1600 TFLOPs624 TFLOPs624 TFLOPs32.8 TFLOPs30.4 TFLOPs21.2 TFLOPs18.7 TFLOPsN/AN/A
FP32 Compute90 TFLOPs67 TFLOPs67 TFLOPs800 TFLOPs156 TFLOPs
(19.5 TFLOPs standard)
156 TFLOPs
(19.5 TFLOPs standard)
16.4 TFLOPs15.7 TFLOPs10.6 TFLOPs10.0 TFLOPs6.8 TFLOPs5.04 TFLOPs
FP64 Compute45 TFLOPs34 TFLOPs34 TFLOPs48 TFLOPs19.5 TFLOPs
(9.7 TFLOPs standard)
19.5 TFLOPs
(9.7 TFLOPs standard)
8.2 TFLOPs7.80 TFLOPs5.30 TFLOPs4.7 TFLOPs0.2 TFLOPs1.68 TFLOPs
Memory Interface8192-bit HBM45120-bit HBM3e5120-bit HBM35120-bit HBM2e6144-bit HBM2e6144-bit HBM2e4096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM2384-bit GDDR5384-bit GDDR5
Memory SizeUp To 192 GB HBM3 @ 8.0 GbpsUp To 141 GB HBM3e @ 6.5 GbpsUp To 80 GB HBM3 @ 5.2 GbpsUp To 94 GB HBM2e @ 5.1 GbpsUp To 40 GB HBM2 @ 1.6 TB/s
Up To 80 GB HBM2 @ 1.6 TB/s
Up To 40 GB HBM2 @ 1.6 TB/s
Up To 80 GB HBM2 @ 2.0 TB/s
16 GB HBM2 @ 1134 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 732 GB/s
12 GB HBM2 @ 549 GB/s
24 GB GDDR5 @ 288 GB/s12 GB GDDR5 @ 288 GB/s
TDP700W700W700W350W400W250W250W300W300W250W250W235W

There's a wide scale adoption being made possible already by NVIDIA and its server partners for the said PCIe based GPU accelerator which include:

NVIDIA hasn't announced any release date or pricing for the card yet but considering the A100 (400W) Tensor Core GPU is already being shipped since its launch, the A100 (250W) PCIe will be following its footsteps soon.

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.