AMD has introduced a new plugin called vLLM-ATOM, which supercharges AI LLMs while supporting its Instinct MI350 and MI400 GPUs.
AMD Offers Big Boost To AI LLMs With Its vLLM-ATOM Plugin That Works Seamlessly With vLLM & Accelerates AI Inference Performance
The vLLM-ATOM is a purpose-built plugin that aims to improve inference performance across various AI LLMs. It is designed around AMD's high-performance Instinct GPU accelerators, such as the MI350 and MI400 series, running both as a standalone inference server or through seamless integration as a plugin backend. This allows users to take full advantage of AMD's native model and kernel optimizations without any modifications to the vLLM's core database.
The main highlights of vLLM-ATOM include:
- Zero learning curve: Full compatibility with existing vLLM commands, APIs, and end-to-end workflows. ATOM runs transparently in the background, requiring no new tools or complex configurations—while delivering enhanced kernel performance while preserving a consistent user experience.
- Instant access to AMD innovation: Leverage cutting-edge AMD hardware features (e.g., FP4 on the MI355X GPU, rack-scale inference on the MI400 GPU) and top-tier kernel optimizations (e.g., AITER fused attention, custom AllReduce) out of the box, without waiting for upstream integration into the main vLLM codebase. This drastically shortens the time-to-value for the new AMD GPUs.
- Agile innovation sandbox: A fast validation layer for new technical ideas, hardware enablement, and kernel library testing (e.g., AITER). The plugin aligns flexibly with the AMD product roadmap, including new GPU releases, FP8/FP4 precision support, and next-gen attention mechanisms—unconstrained by vLLM’s upstream release cycles.
- vLLM as a production-grade foundation for ROCm: As the community-standard serving framework, vLLM provides the enterprise-grade stability, broad model coverage, and production-critical features needed to deploy ROCm-based infrastructure at scale.
- Mature optimizations upstreamed for all: ATOM serves as a temporary proving ground for new optimizations; once stabilized, kernels, optimization strategies, and new features are upstreamed to vLLM’s native ROCm backend, benefiting the entire ROCm software user community and strengthening the open-source LLM ecosystem.
The vLLM-ATOM architecture is broken down into three layers:
| Layer | Responsibility |
|---|---|
| vLLM | Request scheduling, KV cache management, continuous batching, OpenAI-compatible API |
| ATOM Plugin | Platform registration, optimized model implementation, attention backends routing, kernel-level optimization tuning |
| AITER | Low-level GPU kernels — fused MoE, flash attention, quantized GEMM, RoPE fusion |
In terms of model support, the vLLM-ATOM plugin supports both AI LLMs and VLMs through a unified serving pipeline. Following is the full list:
| Architecture | Type | Representative Models | ATOM Model Class |
|---|---|---|---|
| Qwen3MoeForCausalLM | MoE | Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 | atom.models.qwen3_moe |
| DeepseekV3ForCausalLM | MoE (MLA) | deepseek-ai/DeepSeek-R1-0528 (FP8), amd/DeepSeek-R1-0528-MXFP4, amd/Kimi-K2-Thinking-MXFP4 | atom.models.deepseek_v2 |
| GptOssForCausalLM | MoE | openai/gpt-oss-120b | atom.models.gpt_oss |
| Glm4MoeForCausalLM | MoE (MLA) | zai-org/GLM-4.7-FP8 | atom.models.glm4_moe |
| Qwen3NextForCausalLM | Hybrid MoE | Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 | atom.models.qwen3_next |
| Qwen3_5ForConditionalGeneration | Dense (Text/VLM) | Qwen/Qwen3.5-35B-A3B-FP8 | atom.models.qwen3_5 |
| Qwen3_5MoeForConditionalGeneration | MoE (Text/VLM) | Qwen/Qwen3.5-397B-A17B-FP8 | atom.models.qwen3_5 |
| KimiK25ForConditionalGeneration | MoE (Text/VLM) | amd/Kimi-K2.5-MXFP4 | atom.models.kimi_k25 |
AMD's Note: vLLM-ATOM proves that hardware-specific optimization and framework compatibility are not mutually exclusive. By leveraging vLLM’s out-of-the-box plugin mechanism, ATOM delivers AMD-native kernel optimizations—including fused attention, quantized GEMM, and optimized MoE routing—while preserving the full feature set of vLLM that production LLM deployments rely on.
Beyond immediate performance gains, the plugin’s architecture serves as a critical proving ground for AMD’s hardware and software innovations: optimizations validated in ATOM’s plugin mode are gradually upstreamed to vLLM’s native ROCm backend, benefiting the entire ROCm and open-source LLM community. For end users, this means immediate access to the latest AMD hardware capabilities without waiting for slow upstream integration cycles—creating a virtuous cycle of co-evolution between AMD hardware innovation and the vLLM serving ecosystem.
- ATOM Documentation
- vLLM-ATOM Guide
- RFC: Enable ATOM as vLLM out-of-tree Platform
- ATOM Repository
- AITER - AMD Inference Tensor Engine for ROCm
- vLLM-ATOM Recipes
- Docker Hub - ATOM + vLLM Images
Follow Wccftech on Google to get more of our news coverage in your feeds.
