NVIDIA Blackwell Sweeps Every MLPerf 6.0 Benchmark With No Competition In Sight, While GB300 Systems Run Up to 60% Faster Than GB200

Hassan Mujtaba
NVIDIA Blackwell Sweeps Every MLPerf 6.0 Benchmark With No Competition In Sight, While GB300 Systems Run Up to 60% Faster Than GB200
Image Credits: NVIDIA

The latest MLPerf Training 6.0 benchmarks are in & NVIDIA has once again secured performance records with its Blackwell GPUs.

Blackwell GPUs Make Competition Go Into Hiding at MLPerf 6.0 As NVIDIA Tops Benchmark Charts

The latest MLPerf Training v6.0 benchmark results were shared by MLCommons. The latest round adds two new MoE tests for large-scale and entry-level AI deployments: DeepSeek V3 (671b), and GPT-OSS 20B (21b). Being an open-source and peer-reviewed benchmark suite, MLPerf allows all vendors to list the results of their latest and greatest hardware. NVIDIA has been dominating the suite for a while, and it continues to be the trend.

Related Story NVIDIA GB300 Dominates Agentic AI Workloads With 20x Performance Leap Over Hopper As Rubin Nears Launch

While NVIDIA is getting ready to launch its AI-Supercharged Vera Rubin platform in the coming months, the current-generation Blackwell architectures, especially GB300 NVL72 systems, are showcasing immense potential with no competition in sight. In the latest results, NVIDIA shows:

  • Fastest time to train on every benchmark
  • Largest-scale training across 8,192 GPUs using NVIDIA Blackwell NVL72 systems
  • The only platform with submissions across all seven benchmarks in the suite

Coming to the benchmark results, NVIDIA was the fastest at each one of them and was also the only one to submit results across all benchmarks in MLPerf 6.0.

ModelNVIDIA Blackwell NVL72Nearest Alternative
DeepSeek-v3 671B (New)2.02 minsNo submission
GPT-OSS 20B (New)7.43 minsNo submission
Llama 3.1 405B7.07 minsNo submission
Llama 2 70B LoRA0.40 mins8.27 mins
Llama 3.1 8B4.46 mins58.63 mins
FLUX.117.1 mins74.44 mins
DLRM-dcnv20.67 minsNo submission

For reference, NVIDIA's Blackwell platforms were able to achieve stellar speeds. What NVIDIA did in 4.46 mins, the nearest alternative managed to do the same in 58.63 mins, showcasing a 13.1x time split. And for the newest benchmarks, the competition didn't even submit their benchmark results.

Meanwhile, NVIDIA continues to uplift the performance of its existing architectures through further optimizations. Blackwell GB200 is already much faster than it was at launch, but the GB300 systems are up to 60% faster in the same NVL72 configuration thanks to their higher AI compute density with NVFP4.

The Blackwell architecture also scaled to deliver the latest cluster in MLPerf Training, comprising 8192 GPUs running within Microsoft Azure on Llama 3.1 405B. The system reached the quality target in 7.07 minutes, the fastest time-to-train within this benchmark.

  • Microsoft Azure scaled Llama 3.1 405B training to 8,192 GPUs using GB200 NVL72 systems, and reached the reference quality target in 7.07 minutes, the fastest time to train for this benchmark.
  • CoreWeave delivered the fastest time to train for DeepSeek-V3 671B, reaching the quality target in 2.02 minutes at 8,192-GPU scale using GB300 NVL72 systems connected with Spectrum-X Ethernet networking. 

And lastly, we wanted to share the full results comparing NVIDIA Blackwell GPUs against AMD's latest MI300 series offerings up to the MI355X.

MLPerf Training 6.0 Deepseek v3 671b
Latency (in minutes)
0
9000
18000
27000
36000
45000
54000
0
9000
18000
27000
36000
45000
54000
GB300 (8192)
2021
GB300 (4096)
3092
GB200 (8192)
3340
GB200 (4096)
4384
GB300 (2048)
5535
GB200 (2048)
7844
GB300 (512) GB200 (512)
17517
GB300 (256)
27612
GB200 (256)
33430

In DeepSeek v3 671b, NVIDIA is the single dominating force, with the competition not even submitting a single benchmark result.

MLPerf Training 6.0 Flux1
Latency (in minutes)
0
20
40
60
80
100
120
0
20
40
60
80
100
120
GB300 (512)
17.11
GB300 (72)
36.53
GB300 (32)
65.97
MI300X (512)
74.43
MI320X (64)
92.36

In Flux1, 32 NVIDIA GB300 GPUs end up faster than 512 MI300X and 64 MI320X accelerators. No submission for the newer MI350 series was made.

MLPerf Training 6.0 Llama2 70B Lora
Latency (in minutes)
0
5
10
15
20
25
30
0
5
10
15
20
25
30
GB300 (512)
0.400
GB300 (72)
1.166
GB300 (64)
1.263
GB300 (32)
2.470
GB200 (32)
2.851
GB300 (16)
4.508
GB200 (16)
5.345
GB300 (8)
5.613
GB200 (8)
7.856
MI355X (8)
8.271
MI350X (16)
8.522
MI350X (8)
10.093
GB300 (4)
19.301
MI300X (8)
28.648

In Llama 2 70b, NVIDIA's GB300 and GB200 8-accelerator systems outpace the competition.

MLPerf Training 6.0 Llama3.1 8b
Latency (in minutes)
0
50
100
150
200
250
300
0
50
100
150
200
250
300
GB200 (1024)
4.459
GB300 (512)
4.636
GB300 (72)
11.586
GB300 (64)
12.447
GB200 (64)
16.536
GB300 (32)
20.200
GB300 (16)
33.391
GB200 (32)
39.014
GB200 (16)
49.047
MI350X (16)
58.629
GB300 (8)
63.516
GB200 (8)
82.213
MI355X (8)
86.845
MI350X (8)
108.965
GB300 (4)
123.732
MI325X (8)
238.073

Lastly, we have Llama 3.1 8b, where NVIDIA continues to offer more performance at the same number of accelerators, and pushes things beyond that with scale-up configurations.

Whether at massive scale or modest configurations, NVIDIA consistently outperformed the competition, often delivering results that rivals couldn’t even submit. With continued software optimizations and the upcoming Vera Rubin platform on the horizon, NVIDIA’s leadership in AI training remains stronger than ever.

Hassan Mujtaba Photo

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Button