HardwareHot Chips

AMD’s Instinct MI350 GPU Is A AI-Hardware Powerhouse: 3nm 3D Chiplet Based on CDNA 4, 185 Billion Transistors, 1400W TBP, Over 4000B LLM Support With Massive 288GB Memory

•

Aug 26, 2025 at 12:25pm EDT

AMD's Instinct MI355X Snags a $350 Million Customer as TensorWave Doubles Down Ahead of the MI455X vs Vera Rubin Showdown

AMD's Instinct MI350 AI accelerator, featuring the CDNA 4 architecture, has been fully detailed, with its speeds and feeds, at Hot Chips 2025.

AMD Opens Up The Lid of Instinct MI350 Architectural Details, Products, & Solutions At Hot Chips 2025, Ready For Massive LLMs

It's been only two months since AMD launched its Instinct MI350 series, the flagship Accelerator and CDNA 4-based GPU for AI workloads. Today at Hot Chips, they went further into the details of this AI powerhouse.

2 of 9

So, starting off with what kicked off the development of the MI350 series, well, AI obviously, but to be more precise, it was the LLM growth trajectory as model sizes were getting larger each year. Two key factors to address these were to innovate on the data type format front, and another was to simply go bigger on the memory scale on chips. AMD implemented both and a lot more.

As a result, the CDNA-4-based AMD Instinct MI350 series accelerators improve performance and efficiency in doing AI workloads. They extended the HBM bandwidth and capacity, supporting faster AI training and inference on larger models with increased link speeds, and also enhanced power efficiency and performance.

The faster performance is achieved by reducing un-core power, enabling a wider infinity fabric for higher bandwidth at more power-efficient frequencies, and supporting lower precision data formats such as full-access FP8, and industry-standard micro-scaled MXFP6 and MXFP4 data types.

AMD offers its MI350 series in two flavors, the MI350X, which is the air-cooled variant with a 1000W TBP and a max clock speed of 2.2 GHz, while the higher-end MI355X is aimed at liquid-cooled datacenters with a max TBP of 1400W and a max clock speed of 2.4 GHz.

The chip is an architectural masterpiece from AMD, utilizing its years of engineering expertise in the chiplet domain, while utilizing the prowess of its partners for advanced packaging. The chip itself has a total of 185 billion transistors and adopts a 3D Multi-Chiplet layout with two chiplet types, along with HBM3e memory. A dual 3nm + 6nm process technology was leveraged for the MI350 series on the proven COWOS-S packaging technology.

Breaking down the chip, we first have the XCDs or Accelerator Complex Dies, which are based on TSMC's N3P "3nm" process technology. There are 8 of these on a single MI350X/MI355X package and 4 each on an IOD. The IOD or AMD I/O Base Die is based on TSMC's 6nm FinFET "N6" process technology and is a very cost-effective die thanks to its mature process node, which is optimal in terms of yields and costs. There are two of these per package. The IOD houses the Infinity Fabric AP interconnect.

There are a total of 8 HBM3E sites on the package, with each IOD connected to 4 sites. And lastly, there's the main interposer or package on which the entire silicon sits.

Diving deeper into the IO die, there are two of these, each with three Infinity Fabric Links and a PCIe Gen5 link to an AMD EPYC Host (128 GB/s). There are four HBM3E memory controllers, each connected to a 12-Hi stack comprised of 36 GB capacities operating at 8 Gbps for up to 8.0 TB/s of bandwidth. There's 288 GB of HBM3e capacity onboard the package.

Both IO dies are connected using an Infinity Fabric (Advanced Package) interconnect, which offers 5.5 TB/s of bisection bandwidth. There's also 256 MB of AMD Infinity Cache onboard the IO Dies. The Infinity Fabric Links are based on 4th Gen inter-socket links and offer 1075 GB/s bi-directional aggregate bandwidth to the XCDs.

The MI350 series chips pack a total of 32 AMD CDNA 4 compute units per XCD or 256 compute units in total with 128 stream processors per CU for a total of 16,384 cores. These are lower cores than the MI325 and MI300 series, which came packed with 304 compute units and a max core count of 19,456. These compute units are adjusted into eight zones, each with its own XCD, with each XCD packing 32 compute units. There are also 1024 Matrix Cores, and the core can hit a maximum clock speed of 2.4 GHz on the MI355X-class solutions.

The internal memory subsystem onboard the XCD includes 129 KiB of VGPR / SIMD, 512 KiB of Vector Registers/CU, 160 KiB of LDS/CU (537 GB/s), 32 KiB of L1 cache per CU, and 4 MiB of shared L2 cache per XCD. That gives us:

131 MB Vector Registers (Full Chip)
40 MB LDS (Full Chip)
8 MB L1 (Full Chip)
32 MB L2 (Full Chip)
256 MB Infinity Cache (Full Chip)

Moving down, AMD is sharing the data format and compute performance speedups of its MI355X versus MI300X:

Vector FP16: 157.3 TFLOPs (1.0x)
Matrix FP16/BF16: 2.5 PFLOPs (1.9x)
Matrix FP8: 5.0 PFLOPs (1.9x)
Matrix INT8/INT4: 5.0 PFLOPs (1.9x)
Matrix MXFP6/MXFP4: 10 PFLOPs (New)
Vector FP64: 78.6 TFLOPs (1.0x)
Matrix FP64: 78.6 TFLOPs (0.5x)
Vector FP32: 157.3 TFLOPs (1.0x)
Matrix FP32: 157.3 TFLOPs (1.0x)

Compared to NVIDIA's GB200 SXM systems, the MI355X OAM solution offers a 2.1x higher compute output in AI and HPC performance.

You can see the SoC block diagram of the Instinct MI350 series GPU below:

The AMD Instinct MI350 series AI accelerators also support flexible GPU partitioning per socket, where the memory can be partitioned into two separate clusters. This flexibility also applies to the GPUs or XCDs, where you can separate the quad XCD cluster or separate them into dual or singular blocks, allowing the chip to support 8 instances of 70B models in CPX+NPS2.

2 of 9

The Infinity Fabric connectivity also enables 8 accelerators to communicate with a bi-directional link of 154 GB/s, a 20% speedup versus the prior generation.

2 of 9

AMD also talks a bit about the assembly of each chip, from 3D packaging of the silicon to the package assembly, to OAM assembly, and the final heatsink attach phase. These OAMs then go into massive UBBs (2.0), which are universal base boards that house up to 8 accelerators. These go into an industry-standard host node, which ends up in a datacenter-ready EIA rack.

Just talking about the AI compute uplift, AMD claims that the Instinct MI350 series offers 20 PFLOPs of FP4/FP6 compute, which is a 4x gen-on-gen performance uplift. With HBM3e, you get faster data transfer speeds with a super-high capacity of 288 GB on both variants. There's also 256 MB of new Infinity Cache on the chips.

The 4U options can also fit into existing UBB8, which currently houses MI300X AC 750W and MI325X AC 1000W accelerators.

There are two finalized systems. The MI350X platform offers up to 36.9 FP16/BF16 and 73.9 FP8 PFLOPs and scales up to 10U air-cooled solutions. The MI355X platform offers up to 40.2 FP16/BF16 and 80.5 FP8 PFLOPs and scales up to 5U DLC (Direct Liquid Cooled) solutions. Both platforms offer 2.25 TB of HBM3e memory and 1075 GB/s of Infinity Fabric Bandwidth. These solutions are equipped with AMD's latest and greatest 5th Gen EPYC CPUs with Zen 5 cores and Pensando UEC-ready NICs.

The following are the numbers compared against the competition:

MI355x vs B200:

Memory: 1.6x Higher
Bandwidth: 1.0x Higher
FP64: 2.1x Higher
FP16: 1.1x Higher
FP8: 1.1x Higher
FP6: 2.2x Higher
FP4: 1.1x Higher

MI355x vs GB200:

Memory: 1.6x Higher
Bandwidth: 1.0x Higher
FP64: 2.0x Higher
FP16: 1.0x Higher
FP8: 1.0x Higher
FP6: 2.0x Higher
FP4: 1.0x Higher

2 of 9

But how does Instinct MI355X compare to the last-gen MI300 series? Well, AMD just showed a massive 35x leap in Inference performance using Llama 3.1 405B (Throughput), and that's a huge increase.

AMD has already confirmed that the MI350 series will be available through various partners starting in Q3 2025. The next-generation MI400 series is already in the works and is planned for launch in 2026.

AMD Instinct AI Accelerators:

Accelerator Name	AMD Instinct MI500	AMD Instinct MI400	AMD Instinct MI350X	AMD Instinct MI325X	AMD Instinct MI300X	AMD Instinct MI250X
GPU Architecture	CDNA 6	CDNA 5	CDNA 4	Aqua Vanjaram (CDNA 3)	Aqua Vanjaram (CDNA 3)	Aldebaran (CDNA 2)
GPU Process Node	2nm	2nm+3nm	3nm	5nm+6nm	5nm+6nm	6nm
XCDs (Chiplets)	TBD	8 (MCM)	8 (MCM)	8 (MCM)	8 (MCM)	2 (MCM) 1 (Per Die)
GPU Cores	TBD	TBD	16,384	19,456	19,456	14,080
GPU Clock Speed (Max)	TBD	TBD	2400 MHz	2100 MHz	2100 MHz	1700 MHz
INT8 Compute	TBD	TBD	5200 TOPS	2614 TOPS	2614 TOPS	383 TOPs
FP6/FP4 Matrix	TBD	40 PFLOPs	20 PFLOPs	N/A	N/A	N/A
FP8 Matrix	TBD	20 PFLOPs	5 PFLOPs	2.6 PFLOPs	2.6 PFLOPs	N/A
FP16 Matrix	TBD	10 PFLOPs	2.5 PFLOPs	1.3 PFLOPs	1.3 PFLOPs	383 TFLOPs
FP32 Vector	TBD	TBD	157.3 TFLOPs	163.4 TFLOPs	163.4 TFLOPs	95.7 TFLOPs
FP64 Vector	TBD	TBD	78.6 TFLOPs	81.7 TFLOPs	81.7 TFLOPs	47.9 TFLOPs
VRAM	HBM4E	432 GB HBM4	288 GB HBM3e	256 GB HBM3e	192 GB HBM3	128 GB HBM2e
Infinity Cache	TBD	TBD	256 MB	256 MB	256 MB	N/A
Memory Clock	TBD	19.6 TB/s	8.0 Gbps	5.9 Gbps	5.2 Gbps	3.2 Gbps
Memory Bus	TBD	TBD	8192-bit	8192-bit	8192-bit	8192-bit
Memory Bandwidth	TBD	TBD	8 TB/s	6.0 TB/s	5.3 TB/s	3.2 TB/s
Form Factor	TBD	TBD	OAM	OAM	OAM	OAM
Cooling	TBD	Passive / Liquid	Passive / Liquid	Passive Cooling	Passive Cooling	Passive Cooling
TDP (Max)	TBD	TBD	1400W (355X)	1000W	750W	560W

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Deal of the Day