Here’s How NVIDIA’s Blackwell Ultra GB300 AI Racks Are Dominating Long-Context DeepSeek Workloads, Delivering Impressive Gains Versus GB200

Muhammad Zuhair
A person stands next to a large NVIDIA data center server rack with multiple GPUs and visible branding.
Image Credits: NVIDIA

NVIDIA's GB300 NVL72 AI racks have been tested across DeepSeek's latest open source models, and through fine-tuning and optimized inference, the results are indeed promising.

NVIDIA's Blackwell Ultra Scores Up to a 1.5x Lead Over GB200 NVL72 In Latency-Sensitive Workloads

With GB300, NVIDIA's primary focus has been on delivering optimal long-context performance in order to capitalize on the agentic AI wave. In a recent post, we discussed how Blackwell Ultra delivers a 50x increase in throughput per megawatt compared to Hopper GPUs through its extreme co-design approach. Now, the Large Model Systems Organization (LMSYS) has tested GB300 NVL72 for long-context inference, with results looking extremely promising. The testing does include infrastructure-level software routing, which we'll discuss next.

Related Story Agentic AI Pushes CPUs to Pack 400 GB of Memory, 4x More Than Today, as DRAM Shortage Spirals Toward 2027

Given that with long-context workloads, the pressure tends to shift more towards GPU VRAM, the LMSYS team integrated PD (Prefill-Decode) Disaggregation, a widely used mechanism for maintaining large-scale token context. In simple terms, with PD Disaggregation, you split work across different hardware "nodes" to avoid bottlenecks. The prefill phase, which is, in simple terms, prompt processing, along with the decode phase, which is token generation, tends to be much more optimized with disaggregation, leading to improved throughput at scale.

The LMSYS team also employed several other optimization techniques, including dynamic chunking for optimized prompt responses under long-context windows and effective KV capacity translation. In terms of generational improvements, the team noted the following primary benchmarks: throughput analysis, capacity, and latency ratio.

NVIDIA's GB300 NVL72 vs GB200 NVL72:

  • 1.53x Peak Throughput: 226.2 TPS/GPU (Tokens Per Second)
  • 1.87x User Speed: Massive jump in TPS/User via MTP (Multi-Token Prediction).
  • 1.58x Latency Win

According to the LMSYS team, the GB300 on average secures a 1.4x to 1.5x lead over GB200, especially in latency-sensitive scenarios, and, given the focus on agentic workloads, Blackwell Ultra is best positioned to capitalize on them. While Blackwell Ultra surely looks dominant in latency and throughput, we haven't seen TCO figures discussed in the industry yet, especially since, with GB300, deployment costs have risen in parallel.

NVIDIA's approach with each generation appears to focus not just on architectural advancements but also on addressing industry-specific constraints, and in Blackwell Ultra's case, latency figures have seen significant improvements. This is one of the reasons why, in agentic environments, GB300 is emerging as a leading choice for hyperscalers and neoclouds.

Muhammad Zuhair Photo

About the author: Muhammad Zuhair is a hardware and technology reporter for Wccftech, specializing in the semiconductor industry and the complex interplay between technology, manufacturing, and geopolitics. His coverage focuses on the corporate strategies and technological roadmaps of industry giants like TSMC, NVIDIA, Samsung, and Intel. Zuhair's expertise lies in deconstructing complex topics such as fabrication nodes (e.g., 2nm process), the economic impact of policies like the CHIPS Act, and the strategic development of AI infrastructure from NVIDIA, AMD and Intel.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Button