AMD’s vLLM-ATOM Plugin Supercharges DeepSeek-R1, Kimi-K2, and gpt-oss-120B AI LLM Inference on Instinct MI350 and MI400 Accelerators

Hassan Mujtaba
AMD's vLLM-ATOM Plugin Supercharges DeepSeek-R1, Kimi-K2, and gpt-oss-120B AI LLM Inference on Instinct MI350 and MI400 Accelerators
LEADTOOLS v20.0

AMD has introduced a new plugin called vLLM-ATOM, which supercharges AI LLMs while supporting its Instinct MI350 and MI400 GPUs.

AMD Offers Big Boost To AI LLMs With Its vLLM-ATOM Plugin That Works Seamlessly With vLLM & Accelerates AI Inference Performance

The vLLM-ATOM is a purpose-built plugin that aims to improve inference performance across various AI LLMs. It is designed around AMD's high-performance Instinct GPU accelerators, such as the MI350 and MI400 series, running both as a standalone inference server or through seamless integration as a plugin backend. This allows users to take full advantage of AMD's native model and kernel optimizations without any modifications to the vLLM's core database.

Related Story AMD Reverses Course On Removing TSME From Ryzen Chips; Will Reinstate The Feature Through A New BIOS Update

The main highlights of vLLM-ATOM include:

  • Zero learning curve: Full compatibility with existing vLLM commands, APIs, and end-to-end workflows. ATOM runs transparently in the background, requiring no new tools or complex configurations—while delivering enhanced kernel performance while preserving a consistent user experience.
  • Instant access to AMD innovation: Leverage cutting-edge AMD hardware features (e.g., FP4 on the MI355X GPU, rack-scale inference on the MI400 GPU) and top-tier kernel optimizations (e.g., AITER fused attention, custom AllReduce) out of the box, without waiting for upstream integration into the main vLLM codebase. This drastically shortens the time-to-value for the new AMD GPUs.
  • Agile innovation sandbox: A fast validation layer for new technical ideas, hardware enablement, and kernel library testing (e.g., AITER). The plugin aligns flexibly with the AMD product roadmap, including new GPU releases, FP8/FP4 precision support, and next-gen attention mechanisms—unconstrained by vLLM’s upstream release cycles.
  • vLLM as a production-grade foundation for ROCm: As the community-standard serving framework, vLLM provides the enterprise-grade stability, broad model coverage, and production-critical features needed to deploy ROCm-based infrastructure at scale.
  • Mature optimizations upstreamed for all: ATOM serves as a temporary proving ground for new optimizations; once stabilized, kernels, optimization strategies, and new features are upstreamed to vLLM’s native ROCm backend, benefiting the entire ROCm software user community and strengthening the open-source LLM ecosystem.

The vLLM-ATOM architecture is broken down into three layers:

LayerResponsibility
vLLMRequest scheduling, KV cache management, continuous batching, OpenAI-compatible API
ATOM PluginPlatform registration, optimized model implementation, attention backends routing, kernel-level optimization tuning
AITERLow-level GPU kernels — fused MoE, flash attention, quantized GEMM, RoPE fusion

In terms of model support, the vLLM-ATOM plugin supports both AI LLMs and VLMs through a unified serving pipeline. Following is the full list:

ArchitectureTypeRepresentative ModelsATOM Model Class
Qwen3MoeForCausalLMMoEQwen/Qwen3-235B-A22B-Instruct-2507-FP8atom.models.qwen3_moe
DeepseekV3ForCausalLMMoE (MLA)deepseek-ai/DeepSeek-R1-0528 (FP8), amd/DeepSeek-R1-0528-MXFP4, amd/Kimi-K2-Thinking-MXFP4atom.models.deepseek_v2
GptOssForCausalLMMoEopenai/gpt-oss-120batom.models.gpt_oss
Glm4MoeForCausalLMMoE (MLA)zai-org/GLM-4.7-FP8atom.models.glm4_moe
Qwen3NextForCausalLMHybrid MoEQwen/Qwen3-Next-80B-A3B-Instruct-FP8atom.models.qwen3_next
Qwen3_5ForConditionalGenerationDense (Text/VLM)Qwen/Qwen3.5-35B-A3B-FP8atom.models.qwen3_5
Qwen3_5MoeForConditionalGenerationMoE (Text/VLM)Qwen/Qwen3.5-397B-A17B-FP8atom.models.qwen3_5
KimiK25ForConditionalGenerationMoE (Text/VLM)amd/Kimi-K2.5-MXFP4atom.models.kimi_k25

AMD's Note: vLLM-ATOM proves that hardware-specific optimization and framework compatibility are not mutually exclusive. By leveraging vLLM’s out-of-the-box plugin mechanism, ATOM delivers AMD-native kernel optimizations—including fused attention, quantized GEMM, and optimized MoE routing—while preserving the full feature set of vLLM that production LLM deployments rely on.

Beyond immediate performance gains, the plugin’s architecture serves as a critical proving ground for AMD’s hardware and software innovations: optimizations validated in ATOM’s plugin mode are gradually upstreamed to vLLM’s native ROCm backend, benefiting the entire ROCm and open-source LLM community. For end users, this means immediate access to the latest AMD hardware capabilities without waiting for slow upstream integration cycles—creating a virtuous cycle of co-evolution between AMD hardware innovation and the vLLM serving ecosystem.

  1. ATOM Documentation
  2. vLLM-ATOM Guide
  3. RFC: Enable ATOM as vLLM out-of-tree Platform
  4. ATOM Repository
  5. AITER - AMD Inference Tensor Engine for ROCm
  6. vLLM-ATOM Recipes
  7. Docker Hub - ATOM + vLLM Images
Hassan Mujtaba Photo

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Button