Scaling performance on 'Mixture of Experts' AI models is one of the biggest industry constraints, but it appears that NVIDIA has managed to make a breakthrough, credited to co-design performance scaling laws.
NVIDIA's GB200 NVL72 AI Cluster Manages to Bring In 10x Higher Performance on the MoE-Focused Kimi K2 Thinking LLM
The AI world has been racing to scale up foundational LLMs by ramping up token parameters and ensuring that their models excel in performance and applications, but with this approach, there's a limit to the compute resources companies can invest in their AI models. Now here, 'Mixture of Experts' frontier AI models come in play, since for a query, they don't activate the entire parameters per token, rather just a portion of it, depending upon the type of service request. While MoEs have been dominant in LLMs, scaling them up introduces a massive computing bottleneck, which NVIDIA has successfully overcome.
In a press release by the company, NVIDIA has disclosed that with the GB200 'Blackwell' NVL72 configuration onboard, the firm has essentially scaled up performance by a factor of 10 when compared with the Hopper HGX 200. The firm tested its computing capabilities on the Kimi K2 Thinking MoE model, an open-source LLM with 32 billion activated parameters per forward pass, which is known to be a standout option in its segment. Team Green claims that the Blackwell architecture is 'poised' to capitalize on the rise of frontier MoE models.
To address the performance bottlenecks involved in scaling MoE AI models, NVIDIA has employed the 'co-design' approach, which means that by utilizing the 72-chip configuration with the GB200, coupled with 30TB of fast shared memory, NVIDIA takes expert parallelism to a whole new level, ensuring that token batches get split and scattered across GPUs constantly, and the communication volume increases at a non-linear rate. Other optimizations include:
Other full-stack optimizations also play a key role in unlocking high inference performance for MoE models. The NVIDIA Dynamo framework orchestrates disaggregated serving by assigning prefill and decode tasks to different GPUs, allowing decode to run with large expert parallelism, while prefill uses parallelism techniques better suited to its workload. The NVFP4 format helps maintain accuracy while further boosting performance and efficiency.
This achievement is a significant development for NVIDIA and its partners, especially since the GB200 NVL72 configuration is now at the phase of the supply chain where many frontier models utilize AI servers to enhance their capabilities. MoE models are known for their computationally efficient nature, which is why their deployment across a wide range of environments is becoming increasingly prominent, and NVIDIA appears to be at the center of capitalizing on this trend.
Follow Wccftech on Google to get more of our news coverage in your feeds.
