NVIDIA Blackwell GPU Architecture Official: 208 Billion Transistors, 5x AI Performance, 192 GB HBM3e Memory, 8 TB/s Bandwidth

•

Mar 18, 2024 at 04:44pm EDT

NVIDIA Blackwell GPU Architecture Official: 208 Billion Transistors, 5x AI Performance, 192 GB HBM3e Memory, 8 TB/s Bandwidth 1

NVIDIA has officially unveiled its next-gen Blackwell GPU architecture which features up to a 5x performance increase versus Hopper H100 GPUs.

NVIDIA Blackwell GPUs Feature 5x Faster AI Performance Than Hopper H100, Leading The Charge of Next-Gen AI Computing

NVIDIA has gone official with the full details of its next-generation AI & Tensor Core GPU architecture codenamed Blackwell. As expected, the Blackwell GPUs are the first to feature NVIDIA's first MCM design which will incorporate two GPUs on the same die.

World’s Most Powerful Chip — Packed with 208 billion transistors, Blackwell-architecture GPUs are manufactured using a custom-built 4NP TSMC process with two-reticle limit GPU dies connected by 10 TB/second chip-to-chip link into a single, unified GPU.
Second-Generation Transformer Engine — Fueled by new micro-tensor scaling support and NVIDIA’s advanced dynamic range management algorithms integrated into NVIDIA TensorRT™-LLM and NeMo Megatron frameworks, Blackwell will support double the compute and model sizes with new 4-bit floating point AI inference capabilities.
Fifth-Generation NVLink — To accelerate performance for multitrillion-parameter and mixture-of-experts AI models, the latest iteration of NVIDIA NVLink® delivers groundbreaking 1.8TB/s bidirectional throughput per GPU, ensuring seamless high-speed communication among up to 576 GPUs for the most complex LLMs.
RAS Engine — Blackwell-powered GPUs include a dedicated engine for reliability, availability and serviceability. Additionally, the Blackwell architecture adds capabilities at the chip level to utilize AI-based preventative maintenance to run diagnostics and forecast reliability issues. This maximizes system uptime and improves resiliency for massive-scale AI deployments to run uninterrupted for weeks or even months at a time and to reduce operating costs.
Secure AI — Advanced confidential computing capabilities protect AI models and customer data without compromising performance, with support for new native interface encryption protocols, which are critical for privacy-sensitive industries like healthcare and financial services.
Decompression Engine — A dedicated decompression engine supports the latest formats, accelerating database queries to deliver the highest performance in data analytics and data science. In the coming years, data processing, on which companies spend tens of billions of dollars annually, will be increasingly GPU-accelerated.

Diving into the details, the NVIDIA Blackwell GPU features a total of 104 Billion transistors on each compute die which is fabricated on the TSMC 4NP process node. Interestingly, both Synopsys and TSMC have utilized NVIDIA's CuLitho technology for the production of Blackwell GPUs which makes making Each chip accelerates the manufacturing of these next-gen AI accelerator chips. The B100 GPUs are equipped with a 10 TB/s high-bandwidth interface which allows super-fast chip-to-chip interconnect. These GPUs are unified as one chip on the same package, offering up to 208 Billion transistors and full GPU cache coherency.

Compared to the Hopper, the NVIDIA Blackwell GPU offers 128 Billion more transistors, 5x the AI performance which is boosted to 20 petaFlops per chip, and 4x the on-die memory. The GPU itself is coupled with 8 HBM3e stacks featuring the world's fastest memory solution, offering 8 TB/s of memory bandwidth across an 8192-bit bus interface and up to 192 GB HBM3e memory. To quickly sum up the performance figures versus Hopper, you are getting:

20 PFLOPS FP8 (2.5x Hopper)
20 PFLOPS FP6 (2.5x Hopper)
40 PFLOPS FP4 (5.0x Hopper)
740B Parameters (6.0x Hopper)
34T Parameters/sec (5.0x Hopper)
7.2 TB/s NVLINK (4.0x Hopper)

NVIDIA will be offering Blackwell GPUs as a full-on platform, combining two of these GPUs which is four compute dies with a singular Grace CPU (72 ARM Neoverse V2 CPU cores). The GPUs will be inter-connected to each other and the Grace CPUs using a 900 GB/s NVLINK protocol.

NVIDIA Blackwell B200 GPUs For 2024 - 192 GB HBM3e

First up, we have the NVIDIA Blackwell B200 GPU. This is the first of the two Blackwell chips that will be adopted into various designs ranging from SXM modules, PCIe AICs & Superchip platforms. The B200 GPU will be the first NVIDIA GPU to utilize a chiplet design, featuring two compute dies based on the TSMC 4nm process node.

MCM or Multi-Chip-Module has been a long coming on the NVIDIA side of things &it's finally here as the company tries to tackle challenges associated with next-gen process nodes such as yields and cost. Chiplets provide a viable alternative where NVIDIA can still achieve faster gen-over-gen performance without compromising its supply or costs and this is just a stepping stone in its chiplet journey.

The NVIDIA Blackwell B200 GPU will be a monster chip. It incorporates a total of 160 SMs for 20,480 cores. The GPU will feature the latest NVLINK interconnect technology, supporting the same 8 GPU architecture and a 400 GbE networking switch. It's also going to be very power-hungry with a 700W peak TDP though that's also the same as the H100 and H200 chips. Summing this chip up:

TMSC 4NP Process Node
Multi-Chip-Package GPU
1-GPU 104 Billion Transistors
2-GPU 208 Billion Transistors
160 SMs (20,480 Cores)
8 HBM Packages
192 GB HBM3e Memory
8 TB/s Memory Bandwidth
8192-bit Memory Bus Interface
8-Hi Stack HBM3e
PCIe 6.0 Support
700W TDP (Peak)

On the memory side, the Blackwell B200 GPU will pack up to 192 GB of HBM3e memory. This will be featured in eight stacks of 8-hi modules, each featuring 24 GB VRAM capacity across an 8192-bit wide bus interface. This will be a 2.4x increase over the H100 80 GB GPUs which allows the chip to run bigger LLMs.

The NVIDIA Blackwell B200 and its respective platforms will pave a new era of AI computing and offer brutal competition to AMD and Intel's latest chip offerings which are yet to see widespread adoption. With the unveiling of Blackwell, NVIDIA has once again cemented itself as the dominant force of the AI market.

NVIDIA HPC / AI GPUs

NVIDIA Tesla Graphics Card	NVIDIA B200	NVIDIA H200 (SXM5)	NVIDIA H100 (SMX5)	NVIDIA H100 (PCIe)	NVIDIA A100 (SXM4)	NVIDIA A100 (PCIe4)	Tesla V100S (PCIe)	Tesla V100 (SXM2)	Tesla P100 (SXM2)	Tesla P100 (PCI-Express)	Tesla M40 (PCI-Express)	Tesla K40 (PCI-Express)
GPU	B200	H200 (Hopper)	H100 (Hopper)	H100 (Hopper)	A100 (Ampere)	A100 (Ampere)	GV100 (Volta)	GV100 (Volta)	GP100 (Pascal)	GP100 (Pascal)	GM200 (Maxwell)	GK110 (Kepler)
Process Node	4nm	4nm	4nm	4nm	7nm	7nm	12nm	12nm	16nm	16nm	28nm	28nm
Transistors	208 Billion	80 Billion	80 Billion	80 Billion	54.2 Billion	54.2 Billion	21.1 Billion	21.1 Billion	15.3 Billion	15.3 Billion	8 Billion	7.1 Billion
GPU Die Size	TBD	814mm2	814mm2	814mm2	826mm2	826mm2	815mm2	815mm2	610 mm2	610 mm2	601 mm2	551 mm2
SMs	160	132	132	114	108	108	80	80	56	56	24	15
TPCs	80	66	66	57	54	54	40	40	28	28	24	15
L2 Cache Size	TBD	51200 KB	51200 KB	51200 KB	40960 KB	40960 KB	6144 KB	6144 KB	4096 KB	4096 KB	3072 KB	1536 KB
FP32 CUDA Cores Per SM	TBD	128	128	128	64	64	64	64	64	64	128	192
FP64 CUDA Cores / SM	TBD	128	128	128	32	32	32	32	32	32	4	64
FP32 CUDA Cores	TBD	16896	16896	14592	6912	6912	5120	5120	3584	3584	3072	2880
FP64 CUDA Cores	TBD	16896	16896	14592	3456	3456	2560	2560	1792	1792	96	960
Tensor Cores	TBD	528	528	456	432	432	640	640	N/A	N/A	N/A	N/A
Texture Units	TBD	528	528	456	432	432	320	320	224	224	192	240
Boost Clock	TBD	~1850 MHz	~1850 MHz	~1650 MHz	1410 MHz	1410 MHz	1601 MHz	1530 MHz	1480 MHz	1329MHz	1114 MHz	875 MHz
TOPs (DNN/AI)	20,000 TOPs	3958 TOPs	3958 TOPs	3200 TOPs	2496 TOPs	2496 TOPs	130 TOPs	125 TOPs	N/A	N/A	N/A	N/A
FP16 Compute	10,000 TFLOPs	1979 TFLOPs	1979 TFLOPs	1600 TFLOPs	624 TFLOPs	624 TFLOPs	32.8 TFLOPs	30.4 TFLOPs	21.2 TFLOPs	18.7 TFLOPs	N/A	N/A
FP32 Compute	90 TFLOPs	67 TFLOPs	67 TFLOPs	800 TFLOPs	156 TFLOPs (19.5 TFLOPs standard)	156 TFLOPs (19.5 TFLOPs standard)	16.4 TFLOPs	15.7 TFLOPs	10.6 TFLOPs	10.0 TFLOPs	6.8 TFLOPs	5.04 TFLOPs
FP64 Compute	45 TFLOPs	34 TFLOPs	34 TFLOPs	48 TFLOPs	19.5 TFLOPs (9.7 TFLOPs standard)	19.5 TFLOPs (9.7 TFLOPs standard)	8.2 TFLOPs	7.80 TFLOPs	5.30 TFLOPs	4.7 TFLOPs	0.2 TFLOPs	1.68 TFLOPs
Memory Interface	8192-bit HBM4	5120-bit HBM3e	5120-bit HBM3	5120-bit HBM2e	6144-bit HBM2e	6144-bit HBM2e	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	384-bit GDDR5	384-bit GDDR5
Memory Size	Up To 192 GB HBM3 @ 8.0 Gbps	Up To 141 GB HBM3e @ 6.5 Gbps	Up To 80 GB HBM3 @ 5.2 Gbps	Up To 94 GB HBM2e @ 5.1 Gbps	Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 1.6 TB/s	Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 2.0 TB/s	16 GB HBM2 @ 1134 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 732 GB/s	16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s	24 GB GDDR5 @ 288 GB/s	12 GB GDDR5 @ 288 GB/s
TDP	700W	700W	700W	350W	400W	250W	250W	300W	300W	250W	250W	235W

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

NVIDIA Blackwell GPU Architecture Official: 208 Billion Transistors, 5x AI Performance, 192 GB HBM3e Memory, 8 TB/s Bandwidth

NVIDIA Blackwell GPUs Feature 5x Faster AI Performance Than Hopper H100, Leading The Charge of Next-Gen AI Computing

Related Story DeepSeek Is Reportedly Building Its Own Inference Chip to Break Free From Both NVIDIA and Huawei

NVIDIA Blackwell B200 GPUs For 2024 - 192 GB HBM3e

NVIDIA HPC / AI GPUs

Further Reading

A20 Pro Won’t Break Apple’s Long-Held Tradition; No LPDDR6 RAM Support, But Faster Six-Channel Memory With Increased Bandwidth For AI Workloads

Samsung Mass Produces Its First PCIe 6.0 SSD, Hitting 28.4 GB/s To Feed NVIDIA's Vera Rubin AI Servers

Taiwanese Media Wrongly Accuses ASRock's Nick Shih of Attempting To Smuggle NVIDIA-Based Supermicro Servers To China

ADATA Chairman Warns DRAM Prices Jump 30% and NAND 40% in Q3 2026 as AI Starves Consumer Supply