NVIDIA Beats Everyone To DeepSeek V4 With Day-0 Blackwell Support, Pushing 3,500 Tokens Per Second On 1.6T Models

•

Apr 26, 2026 at 05:10am EDT

A person stands next to a large NVIDIA data center server rack with multiple GPUs and visible branding.

DeepSeek V4 is out, bringing major optimizations, including up to 1.6T model sizes, and NVIDIA is ready with Day-0 support on Blackwell GPUs using NVFP4.

NVIDIA Blackwell NVFP4 Architecture Delivers Major Speed-Ups In DeepSeek v4 With More Optimizations On The Way

With the launch of DeepSeek V4, we saw some major optimizations in compute & memory requirements.

The updated AI model uses just 27% of single-token inference FLOPs & 10% of the KV cache when running a one-million-token context window. Two new models were also introduced, one being a Pro model with a parameter size of 1.6T, and a Flash version with a parameter size of 284B.

Specification	DeepSeek-V4-Pro	DeepSeek-V4-Flash
Modality	Text	Text
Total parameters	1.6T	284B
Active parameters	49B	13B
Context length	1M tokens	1M tokens
Max output length	Up to 384K tokens through DeepSeek API docs	Up to 384K tokens through DeepSeek API docs
Primary use cases	Advanced reasoning, coding, long-context agents	High-speed efficiency, chat, routing, summarization
License	MIT	High-speed efficiency, chat, routing, and summarization

With this launch, NVIDIA is showcasing Day-0 support and performance of Blackwell GPUs in DeepSeek V4. The company states that Blackwell GPUs provide the scale and low-latency performance required to run 1M long-context inference and trillion-parameter AI models that V4 is offering.

From data center deployments on NVIDIA Blackwell to managed NIM microservices and fine-tuning workflows, NVIDIA provides a range of options for integrating DeepSeek and other open models across different stages of development and deployment. NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open-source licenses. NVIDIA is committed to optimizing community software and open models lets users broadly share work in AI safety and resilience.

via NVIDIA

In the performance slide, NVIDIA demonstrates almost 3500 TPS throughput per GPU (GB300 or Blackwell Ultra), and these are just preliminary figures that are expected to rise as further optimizations to the co-design stack are made. The NVIDIA Blackwell stack offers a range of technologies specifically designed for models such as V4, including NVFP4, Dynamo, Optimized CUDA Kernels, advanced parallelization techniques, and more.

📊 Day 0 performance is here: DeepSeek-V4-Pro running on NVIDIA Blackwell Ultra.

Using @vllm_project's Day 0 recipe, we’ve captured the initial performance Pareto for DeepSeek’s flagship 1M long-context model. This curve highlights the baseline for balancing AI factory… pic.twitter.com/s6wi1Xvegj
— NVIDIA AI (@NVIDIAAI) April 24, 2026

What's key to DeepSeek V4 is the application of FP4 (MXFP4) quantization, which is used to accelerate both rollouts and inference passes. With FP4 DeepSeek, V4 models reduce memory traffic and sampling latency.

One thing that should be highlighted is that Huawei's latest Ascend chips, the Ascend 950PR and Ascend 950DT, both planned for 2026, feature MXFP4 instructions. This shows that DeepSeek V4 will also be fully compatible with China's domestic AI chips.

With NVIDIA's ongoing optimizations, upcoming models will see a robust ecosystem support out of the box.

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

NVIDIA Beats Everyone To DeepSeek V4 With Day-0 Blackwell Support, Pushing 3,500 Tokens Per Second On 1.6T Models

NVIDIA Blackwell NVFP4 Architecture Delivers Major Speed-Ups In DeepSeek v4 With More Optimizations On The Way

Related Story Intel Foundry Securing Packaging & Wafer Deal With NVIDIA To Make Next-Gen Feynman GPUs Could Be Its Biggest Customer Win Yet

Further Reading

NVIDIA RTX Spark PCs Coming This Fall With First Systems by ASUS & MSI, Followed By Acer & Gigabyte

NVIDIA Blackwell GB300 Continues To Set World Records for MoE Pre-Training While GB200 Sees A 4x Boost In Perf/W Through Continuous AI Software Stack Optimizations

NVIDIA Vera Rubin NVL72 Enters The Stage With A Monstrous 10x Uplift In Token Throughput Versus Blackwell, Achieves 800,000 Tokens/s Vs GB200's 80,000 at The Same 150MW

NVIDIA Rubin GPUs Bring 10x Increase in Agentic AI Performance Versus Blackwell as Its Architecture Gets Fully Unpacked, Featuring 336 billion Transistors