NVIDIA Blackwell Ultra “GB300” GPU, The Fastest AI Chip, Detailed: Dual Reticle GPU With Over 20K Cores, 288 GB HBM3e Memory at 8 TB/s & 50% Faster Than GB200

Hassan Mujtaba
NVIDIA Readies "Scaled-Down" Blackwell B200A AI Accelerator, Targeting The Wider Enterprise & AI Market 1

NVIDIA has provided an in-depth breakdown of its fastest chip for AI, the Blackwell Ultra GB300, which is 50% faster than GB200 & packs 288 GB memory.

NVIDIA's Blackwell Ultra "GB300" Is The Miracle Chip For AI, 50% Faster Than GB200 And Packs 288 GB of Memory

A few days ago, NVIDIA rolled out an article giving a breakdown of its latest and greatest AI chip, the GB300 Blackwell Ultra. This chip is now in full production and has already been rolled out to key customers. While the chip is an extension of the Blackwell solution, it does offer a significant upgrade in terms of performance and features.

Related Story NVIDIA GB300 Dominates Agentic AI Workloads With 20x Performance Leap Over Hopper As Rubin Nears Launch

Just like how the NVIDIA Super series is a better version of the original RTX gaming cards, the Ultra series is an enhanced version of the AI chips that were initially introduced. NVIDIA didn't have Ultra offerings in the previous lineups, such as Hopper and Volta, but those also technically had Ultra or enhanced versions. Plus, even though Ultra chips are better on a hardware level, software updates and optimizations also deliver some substantial gains on Non-Ultra or non-enhanced chips.

So, what is Blackwell Ultra GB300? Well, as said above, it is an enhanced version which makes use of two Reticle-sized Dies and connects them with NVIDIA's NV-HBI high-bandwidth interface to show us as a single GPU. The GPU is quite dense, based on the TSMC 4NP (optimized 5nm for NVIDIA) node, and houses a total of 208 billion transistors. The NV-HBI interface provides a 10 TB/s bandwidth for the two GPU dies, all while functioning as a single chip.

The NVIDIA Blackwell Ultra GB300 GPU packs a total of 160 SMs, each with a total of 128 CUDA cores, four 5th Gen Tensor cores with FP8, FP6, NVFP4 precision compute, 256 KB of Tensor memory or TMEM, and SFUs. This rounds up to a total of 20,480 CUDA cores and 640 Tensor cores, plus 40 MB of TMEM.

FeatureHopperBlackwellBlackwell Ultra
Manufacturing processTSMC 4NTSMC 4NPTSMC 4NP
Transistors80B208B208B
Dies per GPU122
NVFP4 dense | sparse performance10 | 20 PetaFLOPS15  | 20 PetaFLOPS
FP8 dense | sparse performance2 | 4 PetaFLOPS5 | 10 PetaFLOPS5 | 10 PetaFLOPS
Attention acceleration
(SFU EX2)
4.5 TeraExponentials/s5 TeraExponentials/s10.7 TeraExponentials/s
Max HBM capacity80 GB HBM (H100) 
141 GB HBM3E (H200)
192 GB HBM3E288 GB HBM3E
Max HBM bandwidth3.35 TB/s (H100)
4.8 TB/s (H200)
8 TB/s8 TB/s
NVLink bandwidth900 GB/s1,800 GB/s1,800 GB/s
Max power (TGP)Up to 700WUp to 1,200WUp to 1,400W

The 5th Gen Tensor Cores are where all the magic happens, as they are responsible for all the AI compute operations. NVIDIA has delivered major innovations in each generation of Tensor Cores for its GPUs, such as:

  • NVIDIA Volta: 8-thread MMA units, FP16 with FP32 accumulation for training.
  • NVIDIA Ampere: Full warp-wide MMA, BF16, and TensorFloat-32 formats.
  • NVIDIA Hopper: Warp-group MMA across 128 threads, Transformer Engine with FP8 support.
  • NVIDIA Blackwell: 2nd Gen Transformer Engine with FP8, FP6, NVFP4 compute, TMEM Memory

Blackwell Ultra also brings a huge upgrade to memory, offering 288 GB of HBM3e capacities versus a max of 192 GB on the previous Blackwell GB200 solutions. This upgrade is what will lead NVIDIA to support multi-trillion-parameter AI models. The memory comes in 8 stacks with a 16 512-bit controller (8192-bit wide interface) and operates at 8 TB/s per GPU. The memory enables:

  • Complete model residence: 300B+ parameter models without memory offloading.
  • Extended context lengths: Larger KV cache capacity for transformer models.
  • Improved compute efficiency: Higher compute-to-memory ratios for diverse workloads.

The interconnect on Blackwell is the same NVLINK provided by the NVLINK Switch, NVLINK-C2C, and there's also the use of PCIe Gen6 x16 interface for connection to host GPUs. Following are the NVLINK 5 and Host side connectivity features/specs:

  • Per-GPU Bandwidth: 1.8 TB/s bidirectional (18 links x 100 GB/s)
  • Performance Scaling: 2x improvement over NVLink 4 (Hopper GPU)
  • Maximum Topology: 576 GPUs in non-blocking compute fabric
  • Rack-Scale Integration: 72-GPU NVL72 configurations with 130 TB/s aggregate bandwidth
  • PCIe Interface: Gen6 × 16 lanes (256 GB/s bidirectional)
  • NVLink-C2C: Grace CPU-GPU communication with memory coherency (900 GB/s
InterconnectHopper GPUBlackwell GPUBlackwell Ultra GPU
NVLink (GPU-GPU)9001,8001,800
NVLink-C2C (CPU-GPU)900900900
PCIe Interface128 (Gen 5)256 (Gen 6)256 (Gen 6)

The result is that NVIDIA's Blackwell Ultra GB300 platform is able to achieve a 50% increase in Dense Low Precision Compute output using the new NVFP4 standard. The new model delivers near FP8 accuracy, & the differences are often less than 1%. This also reduces the memory footprint by 1.8x versus FP8 and 3.5x versus FP16.

Blackwell Ultra also sees advanced scheduling management and new Enterprise-grade security features, such as:

  • Enhanced GigaThread Engine: Next-generation work scheduler providing improved context switching performance and optimized workload distribution across all 160 SMs.
  • Multi-Instance GPU (MIG): Blackwell Ultra GPUs can be partitioned into different-sized MIG instances. For example, an administrator can create two instances with 140 GB of memory each, four instances with 70 GB each, or seven instances with 34 GB each, enabling secure multi-tenancy with predictable performance isolation.
  • Confidential computing and secure AI: Secure and performant protection for sensitive AI models and data, extending hardware-based Trusted Execution Environment (TEE) to GPUs with industry-first TEE-I/O capabilities in the Blackwell architecture and inline NVLink protection for near-identical throughput when compared to unencrypted modes.
  • Advanced NVIDIA Remote Attestation Service (RAS) engine: AI-powered reliability system monitoring thousands of parameters to predict failures, optimize maintenance schedules, and maximize system uptime in large-scale deployments.

Performance efficiency is another area where Blackwell Ultra GB300 takes charge, offering higher TPS/MW than Blackwell GB200, as shown in the chart below:

All this shows that NVIDIA is simply at the top of the AI ladder with engineering marvels such as Blackwell and Blackwell Ultra. Their in-depth software support and optimizations are what's been really ticking the boxes for them, and the annual hardware cadence plus increased R&D is definitely going to keep them going at it for several years.

Hassan Mujtaba Photo

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Deal of the Day

Button