NVIDIA Slashes DeepSeek v4 Token Costs By Up To 5x Just One Month After Launch, Through Pure Blackwell Software Tuning

Hassan Mujtaba
NVIDIA Slashes DeepSeek v4 Token Costs By Up To 5x Just One Month After Launch, Through Pure Blackwell Software Tuning

NVIDIA Blackwell GPUs continue to see massive optimizations, leading to a 5x drop in token cost in DeepSeek v4 AI models.

NVIDIA Cost Per Token Narrative Sees Massive Gain In DeepSeek V4 As AI Model Sees 5x Boost On Blackwell GPUs With Continued Optimizations

"Cost Per Token" is the fundamental metric for AI TCO, as NVIDIA highlighted this a few months back, and now, the company is delivering the lowest-ever token cost in DeepSeek v4.

Related Story AMD Radeon GPUs Can Now Run NVIDIA PhysX Games With 3x Boost Thanks To ZLUDA, Without Requiring A Dedicated PhysX GPU

Today, NVIDIA announced that its full-stack inference software has brought further optimizations to its hardware stack, such as Blackwell GB200 & GB300, improving their performance & making them better than ever. With the latest optimizations, NVIDIA's Blackwell platform has been able to reduce token costs by up to 5x on DeepSeek V4, just one month after the model's release.

Leading companies and inference providers have already acknowledged these gains on their NVIDIA Blackwell-powered platforms:

  • Baseten used the NVIDIA TensorRT-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second.
  • Cognition uses the NVIDIA Dynamo inference framework to manage inference GPUs, giving its team a ready-made path to scale reinforcement learning workloads without needing to build that infrastructure from scratch. 
  • Deep Infra uses the NVIDIA inference software stack to serve frontier open-source models performantly on Blackwell from day zero, including DeepSeek V4. 
  • Together AI used NVIDIA TensorRT-LLM on Blackwell to help Cursor accelerate the path from model optimizations to production endpoints for its real-time coding experience. 

The lower token costs come from turning individual optimizations into system-level performance on NVIDIA GPUs. NVIDIA explains that its inference software stacks achieve these gains by connecting three layers:

  • Production Operation: Coordinates distributed serving, orchestration, autoscaling and memory management so inference can run across the right compute and storage resources.
  • Application Acceleration: Runs models with high performance while giving developers room to tune and customize, using runtime optimizations such as overlapping compute and communication and kernel fusion.
  • Infrastructure Access: Exposes NVIDIA GPU, networking, memory, and system capabilities without requiring developers to manage every device instruction set or data-transfer protocol directly.

These layers are all assembled in the complete systems, which compounds the optimization. On the other hand, NVIDIA's NVLink, NVFP4, Multi-Token-Prediction, and other technologies also offer meaningful gains, offering a combined 20x throughput increase.

NVIDIA’s Blackwell GPUs, powered by continuous full-stack inference optimizations, have achieved a groundbreaking 5x reduction in cost per token for DeepSeek V4 just one month after its release, reinforcing cost per token as the key metric for AI total cost of ownership.

Through seamless integration of production operations, application acceleration, & infrastructure access, along with technologies like NVLink and NVFP4, Blackwell delivers compounded system-level gains, resulting in up to 20x higher throughput. Leading inference providers, including Baseten, Cognition, Deep Infra, and Together AI, are already leveraging these advancements to deliver superior performance for reasoning, coding, and large-scale workloads, further solidifying NVIDIA’s dominance in efficient AI inference.

Hassan Mujtaba Photo

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Button