NVIDIA has officially unveiled its next-gen Blackwell GPU architecture which features up to a 5x performance increase versus Hopper H100 GPUs.
NVIDIA Blackwell GPUs Feature 5x Faster AI Performance Than Hopper H100, Leading The Charge of Next-Gen AI Computing
NVIDIA has gone official with the full details of its next-generation AI & Tensor Core GPU architecture codenamed Blackwell. As expected, the Blackwell GPUs are the first to feature NVIDIA's first MCM design which will incorporate two GPUs on the same die.
- World’s Most Powerful Chip — Packed with 208 billion transistors, Blackwell-architecture GPUs are manufactured using a custom-built 4NP TSMC process with two-reticle limit GPU dies connected by 10 TB/second chip-to-chip link into a single, unified GPU.
- Second-Generation Transformer Engine — Fueled by new micro-tensor scaling support and NVIDIA’s advanced dynamic range management algorithms integrated into NVIDIA TensorRT™-LLM and NeMo Megatron frameworks, Blackwell will support double the compute and model sizes with new 4-bit floating point AI inference capabilities.
- Fifth-Generation NVLink — To accelerate performance for multitrillion-parameter and mixture-of-experts AI models, the latest iteration of NVIDIA NVLink® delivers groundbreaking 1.8TB/s bidirectional throughput per GPU, ensuring seamless high-speed communication among up to 576 GPUs for the most complex LLMs.
- RAS Engine — Blackwell-powered GPUs include a dedicated engine for reliability, availability and serviceability. Additionally, the Blackwell architecture adds capabilities at the chip level to utilize AI-based preventative maintenance to run diagnostics and forecast reliability issues. This maximizes system uptime and improves resiliency for massive-scale AI deployments to run uninterrupted for weeks or even months at a time and to reduce operating costs.
- Secure AI — Advanced confidential computing capabilities protect AI models and customer data without compromising performance, with support for new native interface encryption protocols, which are critical for privacy-sensitive industries like healthcare and financial services.
- Decompression Engine — A dedicated decompression engine supports the latest formats, accelerating database queries to deliver the highest performance in data analytics and data science. In the coming years, data processing, on which companies spend tens of billions of dollars annually, will be increasingly GPU-accelerated.
Diving into the details, the NVIDIA Blackwell GPU features a total of 104 Billion transistors on each compute die which is fabricated on the TSMC 4NP process node. Interestingly, both Synopsys and TSMC have utilized NVIDIA's CuLitho technology for the production of Blackwell GPUs which makes making Each chip accelerates the manufacturing of these next-gen AI accelerator chips. The B100 GPUs are equipped with a 10 TB/s high-bandwidth interface which allows super-fast chip-to-chip interconnect. These GPUs are unified as one chip on the same package, offering up to 208 Billion transistors and full GPU cache coherency.
Compared to the Hopper, the NVIDIA Blackwell GPU offers 128 Billion more transistors, 5x the AI performance which is boosted to 20 petaFlops per chip, and 4x the on-die memory. The GPU itself is coupled with 8 HBM3e stacks featuring the world's fastest memory solution, offering 8 TB/s of memory bandwidth across an 8192-bit bus interface and up to 192 GB HBM3e memory. To quickly sum up the performance figures versus Hopper, you are getting:
- 20 PFLOPS FP8 (2.5x Hopper)
- 20 PFLOPS FP6 (2.5x Hopper)
- 40 PFLOPS FP4 (5.0x Hopper)
- 740B Parameters (6.0x Hopper)
- 34T Parameters/sec (5.0x Hopper)
- 7.2 TB/s NVLINK (4.0x Hopper)
NVIDIA will be offering Blackwell GPUs as a full-on platform, combining two of these GPUs which is four compute dies with a singular Grace CPU (72 ARM Neoverse V2 CPU cores). The GPUs will be inter-connected to each other and the Grace CPUs using a 900 GB/s NVLINK protocol.
NVIDIA Blackwell B200 GPUs For 2024 - 192 GB HBM3e
First up, we have the NVIDIA Blackwell B200 GPU. This is the first of the two Blackwell chips that will be adopted into various designs ranging from SXM modules, PCIe AICs & Superchip platforms. The B200 GPU will be the first NVIDIA GPU to utilize a chiplet design, featuring two compute dies based on the TSMC 4nm process node.
MCM or Multi-Chip-Module has been a long coming on the NVIDIA side of things &it's finally here as the company tries to tackle challenges associated with next-gen process nodes such as yields and cost. Chiplets provide a viable alternative where NVIDIA can still achieve faster gen-over-gen performance without compromising its supply or costs and this is just a stepping stone in its chiplet journey.
The NVIDIA Blackwell B200 GPU will be a monster chip. It incorporates a total of 160 SMs for 20,480 cores. The GPU will feature the latest NVLINK interconnect technology, supporting the same 8 GPU architecture and a 400 GbE networking switch. It's also going to be very power-hungry with a 700W peak TDP though that's also the same as the H100 and H200 chips. Summing this chip up:
- TMSC 4NP Process Node
- Multi-Chip-Package GPU
- 1-GPU 104 Billion Transistors
- 2-GPU 208 Billion Transistors
- 160 SMs (20,480 Cores)
- 8 HBM Packages
- 192 GB HBM3e Memory
- 8 TB/s Memory Bandwidth
- 8192-bit Memory Bus Interface
- 8-Hi Stack HBM3e
- PCIe 6.0 Support
- 700W TDP (Peak)
On the memory side, the Blackwell B200 GPU will pack up to 192 GB of HBM3e memory. This will be featured in eight stacks of 8-hi modules, each featuring 24 GB VRAM capacity across an 8192-bit wide bus interface. This will be a 2.4x increase over the H100 80 GB GPUs which allows the chip to run bigger LLMs.
The NVIDIA Blackwell B200 and its respective platforms will pave a new era of AI computing and offer brutal competition to AMD and Intel's latest chip offerings which are yet to see widespread adoption. With the unveiling of Blackwell, NVIDIA has once again cemented itself as the dominant force of the AI market.
NVIDIA HPC / AI GPUs
| NVIDIA Tesla Graphics Card | NVIDIA B200 | NVIDIA H200 (SXM5) | NVIDIA H100 (SMX5) | NVIDIA H100 (PCIe) | NVIDIA A100 (SXM4) | NVIDIA A100 (PCIe4) | Tesla V100S (PCIe) | Tesla V100 (SXM2) | Tesla P100 (SXM2) | Tesla P100 (PCI-Express) | Tesla M40 (PCI-Express) | Tesla K40 (PCI-Express) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPU | B200 | H200 (Hopper) | H100 (Hopper) | H100 (Hopper) | A100 (Ampere) | A100 (Ampere) | GV100 (Volta) | GV100 (Volta) | GP100 (Pascal) | GP100 (Pascal) | GM200 (Maxwell) | GK110 (Kepler) |
| Process Node | 4nm | 4nm | 4nm | 4nm | 7nm | 7nm | 12nm | 12nm | 16nm | 16nm | 28nm | 28nm |
| Transistors | 208 Billion | 80 Billion | 80 Billion | 80 Billion | 54.2 Billion | 54.2 Billion | 21.1 Billion | 21.1 Billion | 15.3 Billion | 15.3 Billion | 8 Billion | 7.1 Billion |
| GPU Die Size | TBD | 814mm2 | 814mm2 | 814mm2 | 826mm2 | 826mm2 | 815mm2 | 815mm2 | 610 mm2 | 610 mm2 | 601 mm2 | 551 mm2 |
| SMs | 160 | 132 | 132 | 114 | 108 | 108 | 80 | 80 | 56 | 56 | 24 | 15 |
| TPCs | 80 | 66 | 66 | 57 | 54 | 54 | 40 | 40 | 28 | 28 | 24 | 15 |
| L2 Cache Size | TBD | 51200 KB | 51200 KB | 51200 KB | 40960 KB | 40960 KB | 6144 KB | 6144 KB | 4096 KB | 4096 KB | 3072 KB | 1536 KB |
| FP32 CUDA Cores Per SM | TBD | 128 | 128 | 128 | 64 | 64 | 64 | 64 | 64 | 64 | 128 | 192 |
| FP64 CUDA Cores / SM | TBD | 128 | 128 | 128 | 32 | 32 | 32 | 32 | 32 | 32 | 4 | 64 |
| FP32 CUDA Cores | TBD | 16896 | 16896 | 14592 | 6912 | 6912 | 5120 | 5120 | 3584 | 3584 | 3072 | 2880 |
| FP64 CUDA Cores | TBD | 16896 | 16896 | 14592 | 3456 | 3456 | 2560 | 2560 | 1792 | 1792 | 96 | 960 |
| Tensor Cores | TBD | 528 | 528 | 456 | 432 | 432 | 640 | 640 | N/A | N/A | N/A | N/A |
| Texture Units | TBD | 528 | 528 | 456 | 432 | 432 | 320 | 320 | 224 | 224 | 192 | 240 |
| Boost Clock | TBD | ~1850 MHz | ~1850 MHz | ~1650 MHz | 1410 MHz | 1410 MHz | 1601 MHz | 1530 MHz | 1480 MHz | 1329MHz | 1114 MHz | 875 MHz |
| TOPs (DNN/AI) | 20,000 TOPs | 3958 TOPs | 3958 TOPs | 3200 TOPs | 2496 TOPs | 2496 TOPs | 130 TOPs | 125 TOPs | N/A | N/A | N/A | N/A |
| FP16 Compute | 10,000 TFLOPs | 1979 TFLOPs | 1979 TFLOPs | 1600 TFLOPs | 624 TFLOPs | 624 TFLOPs | 32.8 TFLOPs | 30.4 TFLOPs | 21.2 TFLOPs | 18.7 TFLOPs | N/A | N/A |
| FP32 Compute | 90 TFLOPs | 67 TFLOPs | 67 TFLOPs | 800 TFLOPs | 156 TFLOPs (19.5 TFLOPs standard) | 156 TFLOPs (19.5 TFLOPs standard) | 16.4 TFLOPs | 15.7 TFLOPs | 10.6 TFLOPs | 10.0 TFLOPs | 6.8 TFLOPs | 5.04 TFLOPs |
| FP64 Compute | 45 TFLOPs | 34 TFLOPs | 34 TFLOPs | 48 TFLOPs | 19.5 TFLOPs (9.7 TFLOPs standard) | 19.5 TFLOPs (9.7 TFLOPs standard) | 8.2 TFLOPs | 7.80 TFLOPs | 5.30 TFLOPs | 4.7 TFLOPs | 0.2 TFLOPs | 1.68 TFLOPs |
| Memory Interface | 8192-bit HBM4 | 5120-bit HBM3e | 5120-bit HBM3 | 5120-bit HBM2e | 6144-bit HBM2e | 6144-bit HBM2e | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 384-bit GDDR5 | 384-bit GDDR5 |
| Memory Size | Up To 192 GB HBM3 @ 8.0 Gbps | Up To 141 GB HBM3e @ 6.5 Gbps | Up To 80 GB HBM3 @ 5.2 Gbps | Up To 94 GB HBM2e @ 5.1 Gbps | Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 1.6 TB/s | Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 2.0 TB/s | 16 GB HBM2 @ 1134 GB/s | 16 GB HBM2 @ 900 GB/s | 16 GB HBM2 @ 732 GB/s | 16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s | 24 GB GDDR5 @ 288 GB/s | 12 GB GDDR5 @ 288 GB/s |
| TDP | 700W | 700W | 700W | 350W | 400W | 250W | 250W | 300W | 300W | 250W | 250W | 235W |
Follow Wccftech on Google to get more of our news coverage in your feeds.
