AMD's Instinct MI350 AI accelerator, featuring the CDNA 4 architecture, has been fully detailed, with its speeds and feeds, at Hot Chips 2025.
AMD Opens Up The Lid of Instinct MI350 Architectural Details, Products, & Solutions At Hot Chips 2025, Ready For Massive LLMs
It's been only two months since AMD launched its Instinct MI350 series, the flagship Accelerator and CDNA 4-based GPU for AI workloads. Today at Hot Chips, they went further into the details of this AI powerhouse.
So, starting off with what kicked off the development of the MI350 series, well, AI obviously, but to be more precise, it was the LLM growth trajectory as model sizes were getting larger each year. Two key factors to address these were to innovate on the data type format front, and another was to simply go bigger on the memory scale on chips. AMD implemented both and a lot more.
As a result, the CDNA-4-based AMD Instinct MI350 series accelerators improve performance and efficiency in doing AI workloads. They extended the HBM bandwidth and capacity, supporting faster AI training and inference on larger models with increased link speeds, and also enhanced power efficiency and performance.
The faster performance is achieved by reducing un-core power, enabling a wider infinity fabric for higher bandwidth at more power-efficient frequencies, and supporting lower precision data formats such as full-access FP8, and industry-standard micro-scaled MXFP6 and MXFP4 data types.
AMD offers its MI350 series in two flavors, the MI350X, which is the air-cooled variant with a 1000W TBP and a max clock speed of 2.2 GHz, while the higher-end MI355X is aimed at liquid-cooled datacenters with a max TBP of 1400W and a max clock speed of 2.4 GHz.
The chip is an architectural masterpiece from AMD, utilizing its years of engineering expertise in the chiplet domain, while utilizing the prowess of its partners for advanced packaging. The chip itself has a total of 185 billion transistors and adopts a 3D Multi-Chiplet layout with two chiplet types, along with HBM3e memory. A dual 3nm + 6nm process technology was leveraged for the MI350 series on the proven COWOS-S packaging technology.
Breaking down the chip, we first have the XCDs or Accelerator Complex Dies, which are based on TSMC's N3P "3nm" process technology. There are 8 of these on a single MI350X/MI355X package and 4 each on an IOD. The IOD or AMD I/O Base Die is based on TSMC's 6nm FinFET "N6" process technology and is a very cost-effective die thanks to its mature process node, which is optimal in terms of yields and costs. There are two of these per package. The IOD houses the Infinity Fabric AP interconnect.
There are a total of 8 HBM3E sites on the package, with each IOD connected to 4 sites. And lastly, there's the main interposer or package on which the entire silicon sits.
Diving deeper into the IO die, there are two of these, each with three Infinity Fabric Links and a PCIe Gen5 link to an AMD EPYC Host (128 GB/s). There are four HBM3E memory controllers, each connected to a 12-Hi stack comprised of 36 GB capacities operating at 8 Gbps for up to 8.0 TB/s of bandwidth. There's 288 GB of HBM3e capacity onboard the package.
Both IO dies are connected using an Infinity Fabric (Advanced Package) interconnect, which offers 5.5 TB/s of bisection bandwidth. There's also 256 MB of AMD Infinity Cache onboard the IO Dies. The Infinity Fabric Links are based on 4th Gen inter-socket links and offer 1075 GB/s bi-directional aggregate bandwidth to the XCDs.
The MI350 series chips pack a total of 32 AMD CDNA 4 compute units per XCD or 256 compute units in total with 128 stream processors per CU for a total of 16,384 cores. These are lower cores than the MI325 and MI300 series, which came packed with 304 compute units and a max core count of 19,456. These compute units are adjusted into eight zones, each with its own XCD, with each XCD packing 32 compute units. There are also 1024 Matrix Cores, and the core can hit a maximum clock speed of 2.4 GHz on the MI355X-class solutions.
The internal memory subsystem onboard the XCD includes 129 KiB of VGPR / SIMD, 512 KiB of Vector Registers/CU, 160 KiB of LDS/CU (537 GB/s), 32 KiB of L1 cache per CU, and 4 MiB of shared L2 cache per XCD. That gives us:
- 131 MB Vector Registers (Full Chip)
- 40 MB LDS (Full Chip)
- 8 MB L1 (Full Chip)
- 32 MB L2 (Full Chip)
- 256 MB Infinity Cache (Full Chip)
Moving down, AMD is sharing the data format and compute performance speedups of its MI355X versus MI300X:
- Vector FP16: 157.3 TFLOPs (1.0x)
- Matrix FP16/BF16: 2.5 PFLOPs (1.9x)
- Matrix FP8: 5.0 PFLOPs (1.9x)
- Matrix INT8/INT4: 5.0 PFLOPs (1.9x)
- Matrix MXFP6/MXFP4: 10 PFLOPs (New)
- Vector FP64: 78.6 TFLOPs (1.0x)
- Matrix FP64: 78.6 TFLOPs (0.5x)
- Vector FP32: 157.3 TFLOPs (1.0x)
- Matrix FP32: 157.3 TFLOPs (1.0x)
Compared to NVIDIA's GB200 SXM systems, the MI355X OAM solution offers a 2.1x higher compute output in AI and HPC performance.
You can see the SoC block diagram of the Instinct MI350 series GPU below:
The AMD Instinct MI350 series AI accelerators also support flexible GPU partitioning per socket, where the memory can be partitioned into two separate clusters. This flexibility also applies to the GPUs or XCDs, where you can separate the quad XCD cluster or separate them into dual or singular blocks, allowing the chip to support 8 instances of 70B models in CPX+NPS2.
The Infinity Fabric connectivity also enables 8 accelerators to communicate with a bi-directional link of 154 GB/s, a 20% speedup versus the prior generation.
AMD also talks a bit about the assembly of each chip, from 3D packaging of the silicon to the package assembly, to OAM assembly, and the final heatsink attach phase. These OAMs then go into massive UBBs (2.0), which are universal base boards that house up to 8 accelerators. These go into an industry-standard host node, which ends up in a datacenter-ready EIA rack.
Just talking about the AI compute uplift, AMD claims that the Instinct MI350 series offers 20 PFLOPs of FP4/FP6 compute, which is a 4x gen-on-gen performance uplift. With HBM3e, you get faster data transfer speeds with a super-high capacity of 288 GB on both variants. There's also 256 MB of new Infinity Cache on the chips.
The 4U options can also fit into existing UBB8, which currently houses MI300X AC 750W and MI325X AC 1000W accelerators.
There are two finalized systems. The MI350X platform offers up to 36.9 FP16/BF16 and 73.9 FP8 PFLOPs and scales up to 10U air-cooled solutions. The MI355X platform offers up to 40.2 FP16/BF16 and 80.5 FP8 PFLOPs and scales up to 5U DLC (Direct Liquid Cooled) solutions. Both platforms offer 2.25 TB of HBM3e memory and 1075 GB/s of Infinity Fabric Bandwidth. These solutions are equipped with AMD's latest and greatest 5th Gen EPYC CPUs with Zen 5 cores and Pensando UEC-ready NICs.
The following are the numbers compared against the competition:
MI355x vs B200:
- Memory: 1.6x Higher
- Bandwidth: 1.0x Higher
- FP64: 2.1x Higher
- FP16: 1.1x Higher
- FP8: 1.1x Higher
- FP6: 2.2x Higher
- FP4: 1.1x Higher
MI355x vs GB200:
- Memory: 1.6x Higher
- Bandwidth: 1.0x Higher
- FP64: 2.0x Higher
- FP16: 1.0x Higher
- FP8: 1.0x Higher
- FP6: 2.0x Higher
- FP4: 1.0x Higher
But how does Instinct MI355X compare to the last-gen MI300 series? Well, AMD just showed a massive 35x leap in Inference performance using Llama 3.1 405B (Throughput), and that's a huge increase.
AMD has already confirmed that the MI350 series will be available through various partners starting in Q3 2025. The next-generation MI400 series is already in the works and is planned for launch in 2026.
AMD Instinct AI Accelerators:
| Accelerator Name | AMD Instinct MI500 | AMD Instinct MI400 | AMD Instinct MI350X | AMD Instinct MI325X | AMD Instinct MI300X | AMD Instinct MI250X |
|---|---|---|---|---|---|---|
| GPU Architecture | CDNA 6 | CDNA 5 | CDNA 4 | Aqua Vanjaram (CDNA 3) | Aqua Vanjaram (CDNA 3) | Aldebaran (CDNA 2) |
| GPU Process Node | 2nm | 2nm+3nm | 3nm | 5nm+6nm | 5nm+6nm | 6nm |
| XCDs (Chiplets) | TBD | 8 (MCM) | 8 (MCM) | 8 (MCM) | 8 (MCM) | 2 (MCM) 1 (Per Die) |
| GPU Cores | TBD | TBD | 16,384 | 19,456 | 19,456 | 14,080 |
| GPU Clock Speed (Max) | TBD | TBD | 2400 MHz | 2100 MHz | 2100 MHz | 1700 MHz |
| INT8 Compute | TBD | TBD | 5200 TOPS | 2614 TOPS | 2614 TOPS | 383 TOPs |
| FP6/FP4 Matrix | TBD | 40 PFLOPs | 20 PFLOPs | N/A | N/A | N/A |
| FP8 Matrix | TBD | 20 PFLOPs | 5 PFLOPs | 2.6 PFLOPs | 2.6 PFLOPs | N/A |
| FP16 Matrix | TBD | 10 PFLOPs | 2.5 PFLOPs | 1.3 PFLOPs | 1.3 PFLOPs | 383 TFLOPs |
| FP32 Vector | TBD | TBD | 157.3 TFLOPs | 163.4 TFLOPs | 163.4 TFLOPs | 95.7 TFLOPs |
| FP64 Vector | TBD | TBD | 78.6 TFLOPs | 81.7 TFLOPs | 81.7 TFLOPs | 47.9 TFLOPs |
| VRAM | HBM4E | 432 GB HBM4 | 288 GB HBM3e | 256 GB HBM3e | 192 GB HBM3 | 128 GB HBM2e |
| Infinity Cache | TBD | TBD | 256 MB | 256 MB | 256 MB | N/A |
| Memory Clock | TBD | 19.6 TB/s | 8.0 Gbps | 5.9 Gbps | 5.2 Gbps | 3.2 Gbps |
| Memory Bus | TBD | TBD | 8192-bit | 8192-bit | 8192-bit | 8192-bit |
| Memory Bandwidth | TBD | TBD | 8 TB/s | 6.0 TB/s | 5.3 TB/s | 3.2 TB/s |
| Form Factor | TBD | TBD | OAM | OAM | OAM | OAM |
| Cooling | TBD | Passive / Liquid | Passive / Liquid | Passive Cooling | Passive Cooling | Passive Cooling |
| TDP (Max) | TBD | TBD | 1400W (355X) | 1000W | 750W | 560W |
Follow Wccftech on Google to get more of our news coverage in your feeds.
