NVIDIA Hopper H100 GPU Pictured In All Its Glory: The World’s Fastest 4nm GPU & World’s First With HBM3 Memory

Hassan Mujtaba
NVIDIA's flagship Datacenter GPU, the Hopper H100, has been pictured in all its glory. (Image Credits: CNET)
NVIDIA's flagship Datacenter GPU, the Hopper H100, has been pictured in all its glory. (Image Credits: CNET)

At GTC 2022, NVIDIA unveiled its Hopper H100 GPU, a compute powerhouse designed for the next generation of data centers. It's been a while since we talked about this mighty chip but it looks like NVIDIA gave a close-up of its flagship chip to select media.

NVIDIA Hopper H100 GPU: First With 4nm and HBM3 Technology Gets High-Res Pictures

CNET managed to get hold of not only the graphics board on which the H100 GPU is fused but also the H100 chip itself. The H100 GPU is a monster chip that comes packed with the latest 4nm tech and incorporates 80 Billion transistors along with the bleeding-edge HBM3 memory technology. As per the tech outlet, the H100 is built upon the PG520 PCB board which has over 30 power VRMs & a massive integral interposer that uses TSMC's CoWoS tech to combine the Hopper H100 GPU with a 6-stack HBM3 design.

Related Story RTX Spark To Encourage Industry Towards Lighter Laptops With Less Bulky Cooling Solutions, As Surface Laptop Ultra Targets A 110W TDP

NVIDIA Hopper H100 GPU Pictured (Image Credits: CNET):

20220429-nvidia-h100-hopper-ai-gpu-03-low_res-scale-4_00x-custom-custom
20220429-nvidia-h100-hopper-ai-gpu-04-low_res-scale-4_00x-custom-custom
20220429-nvidia-h100-hopper-ai-gpu-01-low_res-scale-4_00x-custom-custom-custom
20220429-nvidia-h100-hopper-ai-gpu-06-low_res-scale-4_00x-custom-2

Out of the six stacks, two stacks are kept to ensure yield integrity. But the new HBM3 standard allows for up to 80 GB capacities at 3 TB/s speeds which are crazy. For comparison, the current fastest gaming graphics card, the RTX 3090 Ti, offers just 1 TB/s of bandwidth and 24 GB VRAM capacities. Other than that, the H100 Hopper GPU also packs in the latest FP8 data format, and through its new SXM connection, it helps accommodate the 700W power design that the chip is designed around.

NVIDIA Hopper H100 GPU Specifications At A Glance

So coming to the specifications, the NVIDIA Hopper GH100 GPU is composed of a massive 144 SM (Streaming Multiprocessor) chip layout which is featured in a total of 8 GPCs. These GPCs rock total of 9 TPCs which are further composed of 2 SM units each. This gives us 18 SMs per GPC and 144 on the complete 8 GPC configuration. Each SM is composed of up to 128 FP32 units which should give us a total of 18,432 CUDA cores. Following are some of the configurations you can expect from the H100 chip:

The full implementation of the GH100 GPU includes the following units:

  • 8 GPCs, 72 TPCs (9 TPCs/GPC), 2 SMs/TPC, 144 SMs per full GPU
  • 128 FP32 CUDA Cores per SM, 18432 FP32 CUDA Cores per full GPU
  • 4 Fourth-Generation Tensor Cores per SM, 576 per full GPU
  • 6 HBM3 or HBM2e stacks, 12 512-bit Memory Controllers
  • 60 MB L2 Cache
  • Fourth-Generation NVLink and PCIe Gen 5

The NVIDIA H100 GPU with SXM5 board form-factor includes the following units:

  • 8 GPCs, 66 TPCs, 2 SMs/TPC, 132 SMs per GPU
  • 128 FP32 CUDA Cores per SM, 16896 FP32 CUDA Cores per GPU
  • 4 Fourth-generation Tensor Cores per SM, 528 per GPU
  • 80 GB HBM3, 5 HBM3 stacks, 10 512-bit Memory Controllers
  • 50 MB L2 Cache
  • Fourth-Generation NVLink and PCIe Gen 5

This is a 2.25x increase over the full GA100 GPU configuration. NVIDIA is also leveraging from more FP64, FP16 & Tensor cores within its Hopper GPU which would drive up performance immensely. And that's going to be a necessity to rival Intel's Ponte Vecchio which is also expected to feature 1:1 FP64.

The cache is another space where NVIDIA has given much attention, upping it to 48 MB in the Hopper GH100 GPU. This is a 20% increase over the 50 MB cache featured on the Ampere GA100 GPU and 3x the size of AMD's flagship Aldebaran MCM GPU, the MI250X.

Rounding up the performance figures, NVIDIA's GH100 Hopper GPU will offer 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32 and 60 TFLOPs of FP64 Compute performance. These record-shattering figures decimate all other HPC accelerators that came before it. For comparison, this is 3.3x faster than NVIDIA's own A100 GPU and 28% faster than AMD's Instinct MI250X in the FP64 compute. In FP16 compute, the H100 GPU is 3x faster than A100 and 5.2x faster than MI250X which is literally bonkers.

The PCIe variant which is a cut-down model was recently listed over in Japan for over $30,000 US so one can imagine that the SXM variant with a beefier configuration will easily cost around $50 grand.

NVIDIA HPC / AI GPUs

NVIDIA Tesla Graphics CardNVIDIA B200NVIDIA H200 (SXM5)NVIDIA H100 (SMX5)NVIDIA H100 (PCIe)NVIDIA A100 (SXM4)NVIDIA A100 (PCIe4)Tesla V100S (PCIe)Tesla V100 (SXM2)Tesla P100 (SXM2)Tesla P100
(PCI-Express)
Tesla M40
(PCI-Express)
Tesla K40
(PCI-Express)
GPUB200H200 (Hopper)H100 (Hopper)H100 (Hopper)A100 (Ampere)A100 (Ampere)GV100 (Volta)GV100 (Volta)GP100 (Pascal)GP100 (Pascal)GM200 (Maxwell)GK110 (Kepler)
Process Node4nm4nm4nm4nm7nm7nm12nm12nm16nm16nm28nm28nm
Transistors208 Billion80 Billion80 Billion80 Billion54.2 Billion54.2 Billion21.1 Billion21.1 Billion15.3 Billion15.3 Billion8 Billion7.1 Billion
GPU Die SizeTBD814mm2814mm2814mm2826mm2826mm2815mm2815mm2610 mm2610 mm2601 mm2551 mm2
SMs160132132114108108808056562415
TPCs806666575454404028282415
L2 Cache SizeTBD51200 KB51200 KB51200 KB40960 KB40960 KB6144 KB6144 KB4096 KB4096 KB3072 KB1536 KB
FP32 CUDA Cores Per SMTBD128128128646464646464128192
FP64 CUDA Cores / SMTBD128128128323232323232464
FP32 CUDA CoresTBD16896168961459269126912512051203584358430722880
FP64 CUDA CoresTBD16896168961459234563456256025601792179296960
Tensor CoresTBD528528456432432640640N/AN/AN/AN/A
Texture UnitsTBD528528456432432320320224224192240
Boost ClockTBD~1850 MHz~1850 MHz~1650 MHz1410 MHz1410 MHz1601 MHz1530 MHz1480 MHz1329MHz1114 MHz875 MHz
TOPs (DNN/AI)20,000 TOPs3958 TOPs3958 TOPs3200 TOPs2496 TOPs2496 TOPs130 TOPs125 TOPsN/AN/AN/AN/A
FP16 Compute10,000 TFLOPs1979 TFLOPs1979 TFLOPs1600 TFLOPs624 TFLOPs624 TFLOPs32.8 TFLOPs30.4 TFLOPs21.2 TFLOPs18.7 TFLOPsN/AN/A
FP32 Compute90 TFLOPs67 TFLOPs67 TFLOPs800 TFLOPs156 TFLOPs
(19.5 TFLOPs standard)
156 TFLOPs
(19.5 TFLOPs standard)
16.4 TFLOPs15.7 TFLOPs10.6 TFLOPs10.0 TFLOPs6.8 TFLOPs5.04 TFLOPs
FP64 Compute45 TFLOPs34 TFLOPs34 TFLOPs48 TFLOPs19.5 TFLOPs
(9.7 TFLOPs standard)
19.5 TFLOPs
(9.7 TFLOPs standard)
8.2 TFLOPs7.80 TFLOPs5.30 TFLOPs4.7 TFLOPs0.2 TFLOPs1.68 TFLOPs
Memory Interface8192-bit HBM45120-bit HBM3e5120-bit HBM35120-bit HBM2e6144-bit HBM2e6144-bit HBM2e4096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM2384-bit GDDR5384-bit GDDR5
Memory SizeUp To 192 GB HBM3 @ 8.0 GbpsUp To 141 GB HBM3e @ 6.5 GbpsUp To 80 GB HBM3 @ 5.2 GbpsUp To 94 GB HBM2e @ 5.1 GbpsUp To 40 GB HBM2 @ 1.6 TB/s
Up To 80 GB HBM2 @ 1.6 TB/s
Up To 40 GB HBM2 @ 1.6 TB/s
Up To 80 GB HBM2 @ 2.0 TB/s
16 GB HBM2 @ 1134 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 732 GB/s
12 GB HBM2 @ 549 GB/s
24 GB GDDR5 @ 288 GB/s12 GB GDDR5 @ 288 GB/s
TDP700W700W700W350W400W250W250W300W300W250W250W235W
Hassan Mujtaba Photo

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Button