DeepSeek V4 is out, bringing major optimizations, including up to 1.6T model sizes, and NVIDIA is ready with Day-0 support on Blackwell GPUs using NVFP4.
NVIDIA Blackwell NVFP4 Architecture Delivers Major Speed-Ups In DeepSeek v4 With More Optimizations On The Way
With the launch of DeepSeek V4, we saw some major optimizations in compute & memory requirements.
The updated AI model uses just 27% of single-token inference FLOPs & 10% of the KV cache when running a one-million-token context window. Two new models were also introduced, one being a Pro model with a parameter size of 1.6T, and a Flash version with a parameter size of 284B.
| Specification | DeepSeek-V4-Pro | DeepSeek-V4-Flash |
| Modality | Text | Text |
| Total parameters | 1.6T | 284B |
| Active parameters | 49B | 13B |
| Context length | 1M tokens | 1M tokens |
| Max output length | Up to 384K tokens through DeepSeek API docs | Up to 384K tokens through DeepSeek API docs |
| Primary use cases | Advanced reasoning, coding, long-context agents | High-speed efficiency, chat, routing, summarization |
| License | MIT | High-speed efficiency, chat, routing, and summarization |
With this launch, NVIDIA is showcasing Day-0 support and performance of Blackwell GPUs in DeepSeek V4. The company states that Blackwell GPUs provide the scale and low-latency performance required to run 1M long-context inference and trillion-parameter AI models that V4 is offering.
From data center deployments on NVIDIA Blackwell to managed NIM microservices and fine-tuning workflows, NVIDIA provides a range of options for integrating DeepSeek and other open models across different stages of development and deployment. NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open-source licenses. NVIDIA is committed to optimizing community software and open models lets users broadly share work in AI safety and resilience.
In the performance slide, NVIDIA demonstrates almost 3500 TPS throughput per GPU (GB300 or Blackwell Ultra), and these are just preliminary figures that are expected to rise as further optimizations to the co-design stack are made. The NVIDIA Blackwell stack offers a range of technologies specifically designed for models such as V4, including NVFP4, Dynamo, Optimized CUDA Kernels, advanced parallelization techniques, and more.
What's key to DeepSeek V4 is the application of FP4 (MXFP4) quantization, which is used to accelerate both rollouts and inference passes. With FP4 DeepSeek, V4 models reduce memory traffic and sampling latency.
One thing that should be highlighted is that Huawei's latest Ascend chips, the Ascend 950PR and Ascend 950DT, both planned for 2026, feature MXFP4 instructions. This shows that DeepSeek V4 will also be fully compatible with China's domestic AI chips.
With NVIDIA's ongoing optimizations, upcoming models will see a robust ecosystem support out of the box.
Follow Wccftech on Google to get more of our news coverage in your feeds.
