NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 ‘Blackwell’ NVL72 Servers, Fueled by Co-Design Breakthroughs

Muhammad Zuhair
An illustration shows interconnected 'GPU' icons forming a neural network next to a server rack, symbolizing AI processing.
Image Credits: NVIDIA

Scaling performance on 'Mixture of Experts' AI models is one of the biggest industry constraints, but it appears that NVIDIA has managed to make a breakthrough, credited to co-design performance scaling laws.

NVIDIA's GB200 NVL72 AI Cluster Manages to Bring In 10x Higher Performance on the MoE-Focused Kimi K2 Thinking LLM

The AI world has been racing to scale up foundational LLMs by ramping up token parameters and ensuring that their models excel in performance and applications, but with this approach, there's a limit to the compute resources companies can invest in their AI models. Now here, 'Mixture of Experts' frontier AI models come in play, since for a query, they don't activate the entire parameters per token, rather just a portion of it, depending upon the type of service request. While MoEs have been dominant in LLMs, scaling them up introduces a massive computing bottleneck, which NVIDIA has successfully overcome.

Related Story NVIDIA GB300 Dominates Agentic AI Workloads With 20x Performance Leap Over Hopper As Rubin Nears Launch

In a press release by the company, NVIDIA has disclosed that with the GB200 'Blackwell' NVL72 configuration onboard, the firm has essentially scaled up performance by a factor of 10 when compared with the Hopper HGX 200. The firm tested its computing capabilities on the Kimi K2 Thinking MoE model, an open-source LLM with 32 billion activated parameters per forward pass, which is known to be a standout option in its segment. Team Green claims that the Blackwell architecture is 'poised' to capitalize on the rise of frontier MoE models.

To address the performance bottlenecks involved in scaling MoE AI models, NVIDIA has employed the 'co-design' approach, which means that by utilizing the 72-chip configuration with the GB200, coupled with 30TB of fast shared memory, NVIDIA takes expert parallelism to a whole new level, ensuring that token batches get split and scattered across GPUs constantly, and the communication volume increases at a non-linear rate. Other optimizations include:

Other full-stack optimizations also play a key role in unlocking high inference performance for MoE models. The NVIDIA Dynamo framework orchestrates disaggregated serving by assigning prefill and decode tasks to different GPUs, allowing decode to run with large expert parallelism, while prefill uses parallelism techniques better suited to its workload. The NVFP4 format helps maintain accuracy while further boosting performance and efficiency.

This achievement is a significant development for NVIDIA and its partners, especially since the GB200 NVL72 configuration is now at the phase of the supply chain where many frontier models utilize AI servers to enhance their capabilities. MoE models are known for their computationally efficient nature, which is why their deployment across a wide range of environments is becoming increasingly prominent, and NVIDIA appears to be at the center of capitalizing on this trend.

Muhammad Zuhair Photo

About the author: Muhammad Zuhair is a hardware and technology reporter for Wccftech, specializing in the semiconductor industry and the complex interplay between technology, manufacturing, and geopolitics. His coverage focuses on the corporate strategies and technological roadmaps of industry giants like TSMC, NVIDIA, Samsung, and Intel. Zuhair's expertise lies in deconstructing complex topics such as fabrication nodes (e.g., 2nm process), the economic impact of policies like the CHIPS Act, and the strategic development of AI infrastructure from NVIDIA, AMD and Intel.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Deal of the Day

Button