CoreWeave Demonstrates 6X GPU Througput With NVIDIA GB300 NVL72 Vs H100 In DeepSeek R1

Aug 26, 2025 at 03:35pm EDT
NVIDIA technology hardware with logo, featuring advanced computing components.

The latest NVIDIA Blackwell AI superchip can easily outperform the previous-gen H100 GPU by reducing the tensor parallelism, offering significantly higher throughput.

NVIDIA GB300's Superior Memory and Bandwidth Reduce Parallelism Overhead, Resulting in Significantly Higher Throughput Gains Over H100

NVIDIA's Blackwell-powered AI superchips can introduce some drastic advantages over the previous-gen GPUs like the H100. The GB300 is already NVIDIA's best-ever offering, delivering great generational uplifts in compute and much higher memory capacity and bandwidth, which are crucial in heavy AI workloads. This is evident from the latest benchmark, conducted by CoreWeave, which found that NVIDIA's latest platform can offer significantly higher throughput by reducing the tensor parallelism.

Related Story NVIDIA’s Nemotron 3 Super Tops The Open-Source AI Model Chart, Beating DeepSeek & GPT-OSS

CoreWeave tested both platforms in the DeepSeek R1 reasoning model, which is a pretty complex model, but here the major difference was the starkly different configurations. On one hand, it needed a 16x NVIDIA H100 cluster to run the DeepSeek R1 model, but on the other hand, it only needed 4x GB300 GPUs on the NVIDIA GB300 NVL72 infrastructure to get the job done. Despite using one-quarter of the GPUs, the GB300-based system delivered 6X higher raw throughput per GPU, showcasing the GPU's huge advantage in complex AI workloads compared to the H100.

As demonstrated, it is clear that the GB300 has a great advantage over the H100 system as the former allows running the same model in just 4-way tensor parallelism. Due to fewer splits, the inter-GPU communication is improved, and the higher memory capacity and bandwidth also played a crucial role in delivering drastic performance uplifts. With such an architectural leap, the GB300 NVL72 platform looks solid, thanks to the high-bandwidth NVLink and NVSwitch interconnects, which enable the GPUs to exchange data at incredible speeds.

For customers, this enables faster token generation and lower latency while offering more efficient scaling of enterprise AI workloads. CoreWeave highlights the extraordinary specifications and features of the NVIDIA GB300 NVL72 rack-scale system, which offers a huge 37 TB memory capacity (GB300 NVL72 supports up to 40 TB) for running large and complex AI models, and blazing-fast interconnects that deliver 130 TB/s of memory bandwidth.

All in all, the NVIDIA GB300 isn't just about the raw TFLOPs but also efficiency. The reduction in tensor parallelism enables the GB300 to minimize the GPU communication overhead, which usually bottleneck large-scale AI training and inference. With the GB300, enterprises can now achieve much higher throughput even with fewer GPUs, which won't just reduce the overall costs but will also help them scale efficiently.

News Source: CoreWeave

About the author: Sarfraz Khan is a hardware reporter with a focus on PC components and the builder community. With years of experience writing about PC hardware and laptops, his work has been featured on several reputable technology publications. Sarfraz's hands-on experience is demonstrated through his first-person accounts of using and comparing different hardware configurations, providing practical and relatable insights for everyday users. His technical analysis is respected by peers in the enthusiast community and has been cited by specialized hardware sites such as Germany's Igor's Lab.

Follow Wccftech on Google to get more of our news coverage in your feeds.