AMD Rolls Out Gemma 4 Model Support Across Full Range of GPUs & CPUs

Apr 4, 2026 at 02:00am EDT
The logos of AMD and Google are displayed with an x between them, set against a gradient background of purple and red tones.

AMD has rolled out official support for Google's Gemma 4 across its full range of GPUs & CPUs, offering support for the compact AI model.

AMD Radeon GPUs & Ryzen AI CPUs Fully Support Google's Gemma 4 AI Model

Google rolled out its latest family of open-weights AI models, called Gemma 4, which span a range of sizes, from 2B to 31B. With this announcement, AMD is rolling out support across its entire Radeon GPU, and Ryzen AI CPU family.

Related Story SpaceX Locks Google Into A $920 Million-Per-Month Compute Deal After Anthropic, As xAI Abandons Colossus 1’s Messy GPU Mix

Press Release: AMD is proud to provide Day Zero support for the full set of Gemma 4 models across our portfolio of AI-enabled hardware.

This includes AMD Instinct GPUs for cloud and enterprise datacenters, AMD Radeon GPUs for AI workstations, and AMD Ryzen AI processors for AI PCs. Support includes integration with the most popular AI applications like LM Studio, and support for open-source software projects, including vLLM, SGLang, llama.cpp, Ollama, and Lemonade.

Deploying with vLLM

Gemma 4 can be deployed on AMD GPUs using vLLM to take advantage of the many optimizations in this inference framework, particularly relating to support for multiple concurrent requests. The whole range of AMD GPUs supported by vLLM, including multiple generations of both Instinct and Radeon GPUs, can be used with the Gemma 4 models. This support is planned in both the Gemma 4 launch build of upstream vLLM and future nightly builds, installable as either a Docker image or Python installable package using the process documented at https://vllm.ai/.

docker pull vllm/vllm-openai-rocm:gemma4

For all AMD GPUs, vLLM can be invoked with the TRITON_ATTN backend:

vllm serve vllm/vllm-openai-rocm:gemma4 --attention-backend TRITON_ATTN

Support for other attention backends with additional optimizations on MI300 and MI350-series GPUs is planned to be available soon. 

Deploying with SGLang

Gemma 4 can also be deployed on AMD MI300X/MI325X/MI35X GPUs using SGLang, which provides high-performance serving. 

SGLang supports the full Gemma 4 family, including dense models (E2B, E4B, 31B) and the MoE variant (26B-A4B). This support is available in the Gemma 4 launch build of SGLang, via a Docker image following https://cookbook.sglang.io/.  

All Gemma 4 models require the Triton attention backend for bidirectional image-token attention. 

SGLang can be invoked as follows: 

python3 -m sglang.launch_server --model-path <model> --attention-backend triton --tp 1

The Gemma 4 model fits on a single MI300X GPU (192 GB HBM) at TP=1 with full context length. For higher throughput workloads, tensor parallelism can be increased (e.g., --tp 2).

Deploying on local hardware with LM Studio

Gemma 4 models can be easily and performantly deployed on AMD hardware through the open-source llama.cpp project and LM Studio. Users can quickly spin up these models on supported hardware, such as AMD Ryzen AI and Ryzen AI Max processors, as well as Radeon and Radeon PRO graphics cards, by downloading the popular LM Studio application and pairing it with the latest AMD Software: Adrenalin Edition drivers. 

Deploying on local hardware with Lemonade Server

Lemonade Server enables deployment of Gemma 4 models on AMD hardware through an open-source local LLM server with OpenAI‑compatible APIs. It supports acceleration on AMD Radeon and Radeon PRO GPUs via ROCm, and on AMD Ryzen AI processors using the XDNA 2 NPU.

GPU deployment with Lemonade and ROCm

To run Gemma 4 on AMD GPUs with ROCm acceleration:

export LEMONADE_LLAMACPP_ROCM_BIN=/path/to/llama-server
```
lemonade-server serve
curl http://localhost:8000/api/v1/pull \
    -H "Content-Type: application/json" \
    -d '{"model_name": "user.Gemma-4-E4B-IT", "checkpoint": "<insert-checkpoint-name>", "recipe": "llamacpp"}'
```
  • Chat with the model via the OpenAI-compatible API:
```
  curl http://localhost:8000/api/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "user.Gemma-4-E4B-IT", "messages": [{"role": "user", "content": "Hello!"}], "llamacpp": "rocm"}'
```

NPU deployment with Ryzen AI

Developers will be able to deploy Gemma 4 models on NPU by integrating Lemonade Server, which supports the latest AMD XDNA 2 NPU. NPU support for the Gemma-4 E2B and E4B models will arrive with the next Ryzen AI SW update. This update will be integrated into Lemonade and will also be available to developers directly as OnnxRuntime APIs.

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.