NVIDIA's GB300 NVL72 AI racks have been tested across DeepSeek's latest open source models, and through fine-tuning and optimized inference, the results are indeed promising.
NVIDIA's Blackwell Ultra Scores Up to a 1.5x Lead Over GB200 NVL72 In Latency-Sensitive Workloads
With GB300, NVIDIA's primary focus has been on delivering optimal long-context performance in order to capitalize on the agentic AI wave. In a recent post, we discussed how Blackwell Ultra delivers a 50x increase in throughput per megawatt compared to Hopper GPUs through its extreme co-design approach. Now, the Large Model Systems Organization (LMSYS) has tested GB300 NVL72 for long-context inference, with results looking extremely promising. The testing does include infrastructure-level software routing, which we'll discuss next.
Given that with long-context workloads, the pressure tends to shift more towards GPU VRAM, the LMSYS team integrated PD (Prefill-Decode) Disaggregation, a widely used mechanism for maintaining large-scale token context. In simple terms, with PD Disaggregation, you split work across different hardware "nodes" to avoid bottlenecks. The prefill phase, which is, in simple terms, prompt processing, along with the decode phase, which is token generation, tends to be much more optimized with disaggregation, leading to improved throughput at scale.

The LMSYS team also employed several other optimization techniques, including dynamic chunking for optimized prompt responses under long-context windows and effective KV capacity translation. In terms of generational improvements, the team noted the following primary benchmarks: throughput analysis, capacity, and latency ratio.
NVIDIA's GB300 NVL72 vs GB200 NVL72:
- 1.53x Peak Throughput: 226.2 TPS/GPU (Tokens Per Second)
- 1.87x User Speed: Massive jump in TPS/User via MTP (Multi-Token Prediction).
- 1.58x Latency Win
According to the LMSYS team, the GB300 on average secures a 1.4x to 1.5x lead over GB200, especially in latency-sensitive scenarios, and, given the focus on agentic workloads, Blackwell Ultra is best positioned to capitalize on them. While Blackwell Ultra surely looks dominant in latency and throughput, we haven't seen TCO figures discussed in the industry yet, especially since, with GB300, deployment costs have risen in parallel.

NVIDIA's approach with each generation appears to focus not just on architectural advancements but also on addressing industry-specific constraints, and in Blackwell Ultra's case, latency figures have seen significant improvements. This is one of the reasons why, in agentic environments, GB300 is emerging as a leading choice for hyperscalers and neoclouds.
Follow Wccftech on Google to get more of our news coverage in your feeds.





