NVIDIA Delivers Day-1 Support For DeepMind’s DiffusionGemma Open Model Across RTX & DGX Platforms, 150 Tokens/s With DGX Spark

Hassan Mujtaba
NVIDIA Delivers Day-1 Support For DeepMind's DiffusionGemma Open Model Across RTX & DGX Platforms, 150 Tokens/s With DGX Spark

NVIDIA's entire RTX/DGX lineups are getting full support for Google DeepMind's DiffusionGemma Open AI model.

Google Intros Its Newest Open AI Model: DiffusionGemma - NVIDIA Offers Full Support Across Its DGX & RTX Families

The DiffusionGemma model is an open model designed to offer speedy text generation, and with its launch, NVIDIA is announcing support across its RTX and DGX lineups. What's even better is that while DiffusionGemma is fast, NVIDIA's optimizations for the model and its hardware make it even faster.

Related Story Intel’s Project Firefly Takes The Best of Phones To Build Laptops In A Slim 12.9mm Metal Chassis With No Vents

The following are the main highlights of the model:

  • Parallel generation: DiffusionGemma denoises up to 256 tokens per step instead of predicting one at a time. 
  • Built on Gemma 4: DiffusionGemma is built on Gemma 4, a 26-billion-parameter mixture-of-experts model that activates just 3.8 billion parameters per step, pairing a diffusion head with Google’s Gemma 4 architecture. 
  • Up to 4x faster performance: The boost means fast text generation, where single-user generation usually stalls — on local hardware. 
  • Open and local: DiffusionGemma is open-weight under a permissive Apache 2.0 license and runs entirely on RTX and DGX Spark — no cloud, no per-token cost — with day-zero support in Hugging Face Transformers, vLLM and Unsloth.
Model name DiffusionGemma 
Supported modalities Text, image 
Total parameters 25.2B 
Active parameters 3.8B  
Context length Up to 256K tokens 
Precision format BF16, NVFP4 

On NVIDIA's side, they are offering day-1 support across GeForce RTX GPUs, RTX PRO Platforms, and DGX systems ranging from Spark Mini PCs to workstations powered by their datacenter-grade chips. NVIDIA is utilizing its tensor core architecture and the CUDA software stack, offering robust support that requires no additional tuning.

NVIDIA has shared some stats too. The company states that its H100 Tensor Core GPUs on DGX Stations offer 1000 tokens/s (single GPU), DGX Spark systems offer 150 tokens/s, and DGX Station offers the fastest in-class local inference. The solutions offer roughly 4 times faster performance than an equivalent autoregressive model.

  • Locally on the NVIDIA DGX Spark deskside personal AI supercomputer — powered by the NVIDIA GB10 Grace Blackwell Superchip with 128GB of unified memory — with the preinstalled NVIDIA AI software stack ready for prototyping, fine-tuning and fully local agent workflows. 
  • On NVIDIA RTX PRO 6000 workstations, providing developers, researchers, and AI professionals with the headroom to run local low-latency generation and agentic loops as part of a professional workflow. 
  • On DGX Station, delivering best-in-class, high-speed inference at up to 800 tokens/sec for low-latency text generation and agentic loops with 748GB of coherent memory. 
  • On GeForce RTX GPUs, with llama.cpp support coming soon.
PlatformBest ForKey highlightsGetting started
NVIDIA DGX SparkPersonal AI supercomputer for local AI development, autonomous agents, AI research, and prototypingNVIDIA GB10 Grace Blackwell Superchip, 128 GB unified memory, 1 PFLOP of FP4 AI compute, and a preinstalled NVIDIA AI software stack for fully local OpenClaw workflowsDGX Spark playbooks for vLLM and Unsloth; deployment guides; NVIDIA NeMo Automodel fine-tuning guide; vLLM on DGX Spark guide
NVIDIA DGX StationDeskside AI supercomputer for building, running, and scaling AI workloadsNVIDIA GB300 Grace Blackwell Ultra Superchip, NVIDIA AI software stack, 748 GB coherent memory, up to 20 PFLOPS of FP4 compute, and support for models up to 1T parameters. Frontier AI development, inference, and agents at your desk.DGX Station playbooksvLLM on DGX Station guide
NVIDIA RTX + NVIDIA RTX PRODesktop AI apps, Windows development, and local inferenceOptimized local inference performance across desktop and workstation environments for creators and professionalsNVIDIA GB300 Grace Blackwell Ultra Superchip, NVIDIA AI software stack, 748 GB of coherent memory, up to 20 PFLOPS of FP4 compute, and support for models up to 1T parameters. Frontier AI development, inference, and agents at your desk.

Users who want to try out the DiffusionGemma model out of the box can do so right now on an RTX 5090 or DGX Spark system. NVIDIA offers a full-stack and ready-to-use framework to try out the model right now.

Hassan Mujtaba Photo

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Button