AI Hardware

NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 ‘Blackwell’ NVL72 Servers, Fueled by Co-Design Breakthroughs

Muhammad Zuhair • Dec 3, 2025 at 11:02am EST

An illustration shows interconnected 'GPU' icons forming a neural network next to a server rack, symbolizing AI processing. — Image Credits: NVIDIA

Scaling performance on 'Mixture of Experts' AI models is one of the biggest industry constraints, but it appears that NVIDIA has managed to make a breakthrough, credited to co-design performance scaling laws.

NVIDIA's GB200 NVL72 AI Cluster Manages to Bring In 10x Higher Performance on the MoE-Focused Kimi K2 Thinking LLM

The AI world has been racing to scale up foundational LLMs by ramping up token parameters and ensuring that their models excel in performance and applications, but with this approach, there's a limit to the compute resources companies can invest in their AI models. Now here, 'Mixture of Experts' frontier AI models come in play, since for a query, they don't activate the entire parameters per token, rather just a portion of it, depending upon the type of service request. While MoEs have been dominant in LLMs, scaling them up introduces a massive computing bottleneck, which NVIDIA has successfully overcome.

In a press release by the company, NVIDIA has disclosed that with the GB200 'Blackwell' NVL72 configuration onboard, the firm has essentially scaled up performance by a factor of 10 when compared with the Hopper HGX 200. The firm tested its computing capabilities on the Kimi K2 Thinking MoE model, an open-source LLM with 32 billion activated parameters per forward pass, which is known to be a standout option in its segment. Team Green claims that the Blackwell architecture is 'poised' to capitalize on the rise of frontier MoE models.

To address the performance bottlenecks involved in scaling MoE AI models, NVIDIA has employed the 'co-design' approach, which means that by utilizing the 72-chip configuration with the GB200, coupled with 30TB of fast shared memory, NVIDIA takes expert parallelism to a whole new level, ensuring that token batches get split and scattered across GPUs constantly, and the communication volume increases at a non-linear rate. Other optimizations include:

Other full-stack optimizations also play a key role in unlocking high inference performance for MoE models. The NVIDIA Dynamo framework orchestrates disaggregated serving by assigning prefill and decode tasks to different GPUs, allowing decode to run with large expert parallelism, while prefill uses parallelism techniques better suited to its workload. The NVFP4 format helps maintain accuracy while further boosting performance and efficiency.

This achievement is a significant development for NVIDIA and its partners, especially since the GB200 NVL72 configuration is now at the phase of the supply chain where many frontier models utilize AI servers to enhance their capabilities. MoE models are known for their computationally efficient nature, which is why their deployment across a wide range of environments is becoming increasingly prominent, and NVIDIA appears to be at the center of capitalizing on this trend.

About the author: Muhammad Zuhair is a hardware and technology reporter for Wccftech, specializing in the semiconductor industry and the complex interplay between technology, manufacturing, and geopolitics. His coverage focuses on the corporate strategies and technological roadmaps of industry giants like TSMC, NVIDIA, Samsung, and Intel. Zuhair's expertise lies in deconstructing complex topics such as fabrication nodes (e.g., 2nm process), the economic impact of policies like the CHIPS Act, and the strategic development of AI infrastructure from NVIDIA, AMD and Intel.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Read all comments on NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 ‘Blackwell’ NVL72 Servers, Fueled by Co-Design Breakthroughs

NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 ‘Blackwell’ NVL72 Servers, Fueled by Co-Design Breakthroughs

NVIDIA's GB200 NVL72 AI Cluster Manages to Bring In 10x Higher Performance on the MoE-Focused Kimi K2 Thinking LLM

Trending Stories

PlayStation 6 Controller Could Ditch the Part That Wears Out, After Years of DualSense Stick Drift Complaints

Amazon Backpedals on 007 First Light Sequel Threat, Admits IO Interactive Should Probably Make the James Bond Sequel

Sony’s PlayStation Disc Shutdown Draws Ubisoft’s Verdict, as CFO Bets Cheaper Digital Machines Beat the Used-Game Fallout

CXMT Evicted Huawei-Linked Engineers From Its R&D Facility, As The Once-Humble Memory Maker Has Acquired A Taste Of Newfound Power Amid The AI Boom

Amazon Gaming Boss Delays Tomb Raider Catalyst to 2028, Bets Luna Can Be a Legitimate Console Gaming Replacement

Popular Discussions

AMD Medusa Point 10-Core “Zen 6” CPU Beats Strix Point 10-Core “Zen 5” By Nearly 35% While Operating at 5.4 GHz

Watch The AMD “Advancing AI 2026” Event Live Here – Next-Gen Zen 6 EPYC CPUs, Instinct MI400 Series & Helios AI Rack Launch

AMD Unveils Helios, Its Next-Gen AI Powerhouse With MI455X & 6th Gen EPYC, Challenging NVIDIA’s Rack-Scale Dominance

AMD Zen 7 “2028” and Zen 8 “2030” CPU Architectures Confirmed – EPYC Florence “Zen 7” To Feature Next-Gen Node, & ACE Extensions

NVIDIA DLSS 5 Hands Over Full Control To Artists To “Direct The Final Frame”, As SIGGRAPH Technical Demo Shows How Neural Rendering Solved Big Challenge To Achieve 4K “Life-Like” Visuals On A Single GPU

NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 ‘Blackwell’ NVL72 Servers, Fueled by Co-Design Breakthroughs

NVIDIA's GB200 NVL72 AI Cluster Manages to Bring In 10x Higher Performance on the MoE-Focused Kimi K2 Thinking LLM

Related Story NVIDIA Starts Championing Open-Weight AI Models As They Increase Demand For Compute, And Despite Doing Everything To Maintain Its CUDA Moat

Further Reading

Trending Stories

Popular Discussions