AMD Instinct MI325X Is The First AI GPU To Pack 256 GB HBM3e Memory, 288 GB MI355X “CDNA 4” Next Year With 8x Performance Uplift

•

Oct 10, 2024 at 01:00pm EDT

AMD Instinct MI325X Is The First AI GPU To Pack 256 GB HBM3e Memory, 288 GB MI355X "CDNA 4" Next Year With 8x Performance Uplift 1

AMD has launched its latest Instinct MI325X AI GPU accelerator which comes packed with 256 GB HBM3e memory while next year's MI355X gets 288 GB.

AMD Goes All Out With HBM3e Memory Capacities: 256 GB on MI325X "CDNA 3" This Year & 288 GB on MI355X "CDNA 4" Next Year

As part of today's "Advancing AI" event, AMD is rolling out its brand new Instinct MI325X AI GPU Accelerator which improves upon the MI300X with brand new capabilities.

But before we get into the details, we have to talk about AMD's Instinct platform as a whole which has garnered support from the world's top AI companies and is being used by some of the biggest brands such as Meta, OpenAI and Microsoft.

AMD's commitment towards performance leadership, easy migration, an open ecosystem, and customer-focused portfolio has led to huge support from leading OEMs and cloud partners, and as such, the company has fast-tracked the launch of its next-solution as the AI demands in the industry grow to unparalleled heights.

AMD MI325X With 256 GB Memory & CDNA 3 Architecture

Currently, AMD's MI300X is said to offer up to 30% higher performance across a range of AI-specific workloads against the NVIDIA H100. AMD's added work to their ROCm suite is helping extract more performance out of the flagship accelerator but now's the time to build even better hardware with the same robust software support.

Meet the AMD Instinct MI325X, this brand-new accelerator is built upon the same fundamental design and architecture as the MI300X. Using the CDNA 3 GPU architecture, the MI325X can be seen as a mid-cycle upgrade, offering 256 GB of HBM3e memory made using 16-Hi stacks with up to 6 TB/s of memory bandwidth, 2.6 PFLOPs of FP8, 1.3 PFLOPs of FP16 performance, all packed within a chip with 153 Billion transistors.

AMD expects the first production of Instinct MI325X AI GPUs starting in Q4 2024 along with the availability of respective server solutions starting in Q1 2025 through leading partners. The AI Instinct servers will be featuring up to 8 MI325X configurations with up to 2 TB of HBM3e memory, 896 GB/s of infinity fabric bandwidth, 48 TB/s of memory bandwidth, 20.8 PFLOPs of FP8 and 10.4 PFLOPs of FP16 performance. Each GPU is also configured at 1000W which is a big uptick over the 750-700W configurations of the MI300X.

Drilling down the numbers, AMD claims that the Instinct MI325X AI GPU accelerator should be 40% faster than the NVIDIA H200 in Mixtral 8x7B, 30% faster in Mistral 7B, and 20% faster in Meta Llama 3.1 70B LLMs. An 8x MI325X platform will also offer 40% faster performance versus an H200 HGX AI platform in Llama 3.1 405B and 20% faster in the 70B inference test. In terms of AI training, MI325X will offer similar or 10% better performance than the H200 platforms.

AMD MI355X With 288 GB Memory & CDNA 4 Architecture

Next year, AMD plans to launch a brand new Instinct MI355X GPU accelerator which will target AI workloads and this will be built using a 3nm process node. The GPU will incorporate the CDNA 4 architecture. In terms of specs, the memory will be upgraded to even higher capacities, up to 288 GB HBM3e while offering support for FP4/FP6 Data types.

AMD says that the CDNA 4 architecture delivers a 35x performance leap over CDNA 3 plus a 7x increase in AI compute, 50% increase in memory capacity/bandwidth, and also comes with the latest networking efficiency advancements.

In terms of performance, the AMD Instinct MI355X AI GPU will offer up to 2.3 PFLOPs of FP16 performance, an 80% increase over the MI325X while the FP8 figures also see an 80% increase to 4.6 PFLOPs versus the MI325X. The new FP6 and FP4 compute performance is rated at 9.2 PFLOPs.

The MI355X will mark a 50% increase in both memory capacities and memory bandwidth, with up to 8 TB/s speeds over the current-gen MI300X. The first platforms featuring eight of these MI355X GPUs will be available in the second half of 2025 and offer up to 2.3 TB of HBM3E memory capacity with 64 TB/s bandwidth, 18.5 PFLOPs of FP16, 37 PFLOPs of FP8, & 74 PFLOPs of FP6/FP4 compute.

ROCm 6.2 Continues To Dial Up AI Performance For Instinct

Moving back to the software front, AMD is announcing its latest ROCm 6.2 ecosystem which brings an average performance improvement of 2.4x and up to 2.8x across a range of AI workloads within Inferencing and an average 2.4x improvement in Training performance.

Lastly, AMD is still confirming its Instinct MI400 which was released in 2026 as a "CDNA Next" part and not using the recently disclosed UDNA architecture name. Maybe it's a bit too early to go with the UDNA naming since it hasn't been made official by AMD despite one of their top representatives confirming it so we will see how that goes in the future.

With that said, AMD looks to be going all in on the AI craze with the future Instinct offerings, bringing heated competition against the likes of NVIDIA and also tackling Intel who have been struggling to catch up with the rest.

AMD Instinct MI325X AI GPU Accelerator Gallery:

AMD Instinct AI Accelerators:

Accelerator Name	AMD Instinct MI600	AMD Instinct MI500	AMD Instinct MI400	AMD Instinct MI350X	AMD Instinct MI325X	AMD Instinct MI300X	AMD Instinct MI250X
GPU Architecture	CDNA Next	CDNA 6	CDNA 5	CDNA 4	Aqua Vanjaram (CDNA 3)	Aqua Vanjaram (CDNA 3)	Aldebaran (CDNA 2)
GPU Process Node	TBD	Sub-2nm	2nm+3nm	3nm	5nm+6nm	5nm+6nm	6nm
XCDs (Chiplets)	TBD	TBD	8 (MCM)	8 (MCM)	8 (MCM)	8 (MCM)	2 (MCM) 1 (Per Die)
GPU Cores	TBD	TBD	~32,000	16,384	19,456	19,456	14,080
GPU Clock Speed (Max)	TBD	TBD	2400 MHz	2400 MHz	2100 MHz	2100 MHz	1700 MHz
FP6/FP4 Matrix	TBD	TBD	40 PFLOPs	20 PFLOPs	N/A	N/A	N/A
INT8 Compute	TBD	TBD	20 PFLOPs	5200 TOPS	2614 TOPS	2614 TOPS	383 TOPs
FP8 Matrix	TBD	TBD	20 PFLOPs	5 PFLOPs	2.6 PFLOPs	2.6 PFLOPs	N/A
FP16 Matrix	TBD	TBD	10 PFLOPs	2.5 PFLOPs	1.3 PFLOPs	1.3 PFLOPs	383 TFLOPs
FP32 Vector	TBD	TBD	315 TFLOPs	157.3 TFLOPs	163.4 TFLOPs	163.4 TFLOPs	95.7 TFLOPs
FP64 Vector	TBD	TBD	288 TFLOPs (MI430X)	78.6 TFLOPs	81.7 TFLOPs	81.7 TFLOPs	47.9 TFLOPs
VRAM	HBM5?	HBM4E	432 GB HBM4	288 GB HBM3e	256 GB HBM3e	192 GB HBM3	128 GB HBM2e
Infinity Cache	TBD	TBD	192 MB	256 MB	256 MB	256 MB	N/A
Memory Clock	TBD	TBD	TBD	8.0 Gbps	5.9 Gbps	5.2 Gbps	3.2 Gbps
Memory Bus	TBD	TBD	24576-bit	8192-bit	8192-bit	8192-bit	8192-bit
Memory Bandwidth	TBD	TBD	23.3 TB/s	8 TB/s	6.0 TB/s	5.3 TB/s	3.2 TB/s
Form Factor	TBD	TBD	EAM	OAM	OAM	OAM	OAM
Cooling	TBD	TBD	Passive / Liquid	Passive / Liquid	Passive Cooling	Passive Cooling	Passive Cooling
TDP (Max)	TBD	TBD	TBD	1400W (355X)	1000W	750W	560W

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.