AMD Instinct MI325X Is The First AI GPU To Pack 256 GB HBM3e Memory, 288 GB MI355X “CDNA 4” Next Year With 8x Performance Uplift

Oct 10, 2024 at 01:00pm EDT
AMD Instinct MI325X Is The First AI GPU To Pack 256 GB HBM3e Memory, 288 GB MI355X "CDNA 4" Next Year With 8x Performance Uplift 1

AMD has launched its latest Instinct MI325X AI GPU accelerator which comes packed with 256 GB HBM3e memory while next year's MI355X gets 288 GB.

AMD Goes All Out With HBM3e Memory Capacities: 256 GB on MI325X "CDNA 3" This Year & 288 GB on MI355X "CDNA 4" Next Year

As part of today's "Advancing AI" event, AMD is rolling out its brand new Instinct MI325X AI GPU Accelerator which improves upon the MI300X with brand new capabilities.

Related Story AMD Believes Unified Memory Architectures Open Up a “World of Possibilities”, Will Shape Their Product Choices & Roadmaps In Future

But before we get into the details, we have to talk about AMD's Instinct platform as a whole which has garnered support from the world's top AI companies and is being used by some of the biggest brands such as Meta, OpenAI and Microsoft.

AMD's commitment towards performance leadership, easy migration, an open ecosystem, and customer-focused portfolio has led to huge support from leading OEMs and cloud partners, and as such, the company has fast-tracked the launch of its next-solution as the AI demands in the industry grow to unparalleled heights.

AMD MI325X With 256 GB Memory & CDNA 3 Architecture

Currently, AMD's MI300X is said to offer up to 30% higher performance across a range of AI-specific workloads against the NVIDIA H100. AMD's added work to their ROCm suite is helping extract more performance out of the flagship accelerator but now's the time to build even better hardware with the same robust software support.

Meet the AMD Instinct MI325X, this brand-new accelerator is built upon the same fundamental design and architecture as the MI300X. Using the CDNA 3 GPU architecture, the MI325X can be seen as a mid-cycle upgrade, offering 256 GB of HBM3e memory made using 16-Hi stacks with up to 6 TB/s of memory bandwidth, 2.6 PFLOPs of FP8, 1.3 PFLOPs of FP16 performance, all packed within a chip with 153 Billion transistors.

AMD expects the first production of Instinct MI325X AI GPUs starting in Q4 2024 along with the availability of respective server solutions starting in Q1 2025 through leading partners. The AI Instinct servers will be featuring up to 8 MI325X configurations with up to 2 TB of HBM3e memory, 896 GB/s of infinity fabric bandwidth, 48 TB/s of memory bandwidth, 20.8 PFLOPs of FP8 and 10.4 PFLOPs of FP16 performance. Each GPU is also configured at 1000W which is a big uptick over the 750-700W configurations of the MI300X.

Drilling down the numbers, AMD claims that the Instinct MI325X AI GPU accelerator should be 40% faster than the NVIDIA H200 in Mixtral 8x7B, 30% faster in Mistral 7B, and 20% faster in Meta Llama 3.1 70B LLMs. An 8x MI325X platform will also offer 40% faster performance versus an H200 HGX AI platform in Llama 3.1 405B and 20% faster in the 70B inference test. In terms of AI training, MI325X will offer similar or 10% better performance than the H200 platforms.

AMD MI355X With 288 GB Memory & CDNA 4 Architecture

Next year, AMD plans to launch a brand new Instinct MI355X GPU accelerator which will target AI workloads and this will be built using a 3nm process node. The GPU will incorporate the CDNA 4 architecture. In terms of specs, the memory will be upgraded to even higher capacities, up to 288 GB HBM3e while offering support for FP4/FP6 Data types.

AMD says that the CDNA 4 architecture delivers a 35x performance leap over CDNA 3 plus a 7x increase in AI compute, 50% increase in memory capacity/bandwidth, and also comes with the latest networking efficiency advancements.

In terms of performance, the AMD Instinct MI355X AI GPU will offer up to 2.3 PFLOPs of FP16 performance, an 80% increase over the MI325X while the FP8 figures also see an 80% increase to 4.6 PFLOPs versus the MI325X. The new FP6 and FP4 compute performance is rated at 9.2 PFLOPs.

The MI355X will mark a 50% increase in both memory capacities and memory bandwidth, with up to 8 TB/s speeds over the current-gen MI300X. The first platforms featuring eight of these MI355X GPUs will be available in the second half of 2025 and offer up to 2.3 TB of HBM3E memory capacity with 64 TB/s bandwidth, 18.5 PFLOPs of FP16, 37 PFLOPs of FP8, & 74 PFLOPs of FP6/FP4 compute.

ROCm 6.2 Continues To Dial Up AI Performance For Instinct

Moving back to the software front, AMD is announcing its latest ROCm 6.2 ecosystem which brings an average performance improvement of 2.4x and up to 2.8x across a range of AI workloads within Inferencing and an average 2.4x improvement in Training performance.

Lastly, AMD is still confirming its Instinct MI400 which was released in 2026 as a "CDNA Next" part and not using the recently disclosed UDNA architecture name. Maybe it's a bit too early to go with the UDNA naming since it hasn't been made official by AMD despite one of their top representatives confirming it so we will see how that goes in the future.

With that said, AMD looks to be going all in on the AI craze with the future Instinct offerings, bringing heated competition against the likes of NVIDIA and also tackling Intel who have been struggling to catch up with the rest.

AMD Instinct MI325X AI GPU Accelerator Gallery:

AMD Instinct AI Accelerators:

Accelerator NameAMD Instinct MI500AMD Instinct MI400AMD Instinct MI350XAMD Instinct MI325XAMD Instinct MI300XAMD Instinct MI250X
GPU ArchitectureCDNA 6CDNA 5CDNA 4Aqua Vanjaram (CDNA 3)Aqua Vanjaram (CDNA 3)Aldebaran (CDNA 2)
GPU Process Node2nm2nm+3nm3nm5nm+6nm5nm+6nm6nm
XCDs (Chiplets)TBD8 (MCM)8 (MCM)8 (MCM)8 (MCM)2 (MCM)
1 (Per Die)
GPU CoresTBDTBD16,38419,45619,45614,080
GPU Clock Speed (Max)TBDTBD2400 MHz2100 MHz2100 MHz1700 MHz
INT8 ComputeTBDTBD5200 TOPS2614 TOPS2614 TOPS383 TOPs
FP6/FP4 MatrixTBD40 PFLOPs20 PFLOPsN/AN/AN/A
FP8 MatrixTBD20 PFLOPs5 PFLOPs2.6 PFLOPs2.6 PFLOPsN/A
FP16 MatrixTBD10 PFLOPs2.5 PFLOPs1.3 PFLOPs1.3 PFLOPs383 TFLOPs
FP32 VectorTBDTBD157.3 TFLOPs163.4 TFLOPs163.4 TFLOPs95.7 TFLOPs
FP64 VectorTBDTBD78.6 TFLOPs81.7 TFLOPs81.7 TFLOPs47.9 TFLOPs
VRAMHBM4E432 GB HBM4288 GB HBM3e256 GB HBM3e192 GB HBM3128 GB HBM2e
Infinity CacheTBDTBD256 MB256 MB256 MBN/A
Memory ClockTBD19.6 TB/s8.0 Gbps5.9 Gbps5.2 Gbps3.2 Gbps
Memory BusTBDTBD8192-bit8192-bit8192-bit8192-bit
Memory BandwidthTBDTBD8 TB/s6.0 TB/s5.3 TB/s3.2 TB/s
Form FactorTBDTBDOAMOAMOAMOAM
CoolingTBDPassive / LiquidPassive / LiquidPassive CoolingPassive CoolingPassive Cooling
TDP (Max)TBDTBD1400W (355X)1000W750W560W

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.