By securing wins across all MLPerf training tests, NVIDIA boasts its Blackwell Ultra-based GB300 NVL72 platform, which delivers leading AI training performance.
NVIDIA Showcases its GB300 NVL72 "Blackwell Ultra" Results in MLPerf AI Training Tests; Up To Five Times the Performance vs Hopper-Based Platform
When it comes to delivering leading AI performance, NVIDIA GPUs have always been at the forefront. The Blackwell-based data center GPUs have already showcased their incredible potential several times previously, and the latest GB300 NVL72 platform is no exception.
Today, NVIDIA has proudly announced that its Blackwell Ultra-powered AI GPUs have secured the first position in every MLPerf training benchmark, proving that its GB300 NVL72 rack-scale system is still the best possible choice for intensive AI workloads.
In the blog post, NVIDIA claims that it's the only player to have submitted the results on every MLPerf test and has expanded the performance gap between itself and its rivals. The graph it shared shows that NVIDIA's GB200 and GB300 platform has scored numerous of MLPerf Training and Inference wins this year. The most recent ones are these:
- Llama 3.1 405B: 10 min
- Llama 2 70B LoRA: 0.4 min
- Llama 3.1 8B: 5.2 min
- FLUX.1: 12.5 min
- DLRM-dcnv2: 0.71 min
- R-GAT: 1.1 min
- RetinaNet: 1.4 min
The benchmark results show that NVIDIA achieved significantly superior results with the same number of Blackwell Ultra GPUs in the rack system as the Hopper-based GPUs. In Llama 3.1 40B pretraining, the GB300 GPUs deliver over 4X the performance vs H100 and nearly 2X vs the Blackwell GB200. Similarly, in the Llama 2 70B Fine-Tuning, 8 GB300 GPUs delivered 5X the performance vs H100.
NVIDIA also boasted about its CUDA ecosystem, which has a big leverage over its competitors. The CUDA software stack excels at it, but the rack system itself, plus the Quantum-X800 InfiniBand at 800 GB/s networking, is also unmatched. The GB300 NVL72 brings 279 GB HBM3e memory per GPU, and an incredible 40 TB total capacity with GPU and CPU memory combined. Such a monster memory configuration speeds up AI workloads, but using the FP4 precision for training is also the key to excellent performance.
NVIDIA says that it has ensured the adoption of FP4 precision for LLM training at every layer to double the speed of calculations compared to FP8. The Blackwell Ultra further boosts that to 3X, which is why NVIDIA was able to crush the competitors and deliver drastically superior performance without increasing the GPU count. Compared to its June submission, the new results were achieved using 5,120 Blackwell GB200 GPUs, which took only 10 minutes to train the Llama 3.1 405B parameter.
Update: The Llama 3.1 405B benchmark was conducted using GB200 NVL72 and not GB300 NVL72.
News Source: NVIDIA
Follow Wccftech on Google to get more of our news coverage in your feeds.
