AMD’s Instinct MI350 GPU Is A AI-Hardware Powerhouse: 3nm 3D Chiplet Based on CDNA 4, 185 Billion Transistors, 1400W TBP, Over 4000B LLM Support With Massive 288GB Memory

Aug 26, 2025 at 12:25pm EDT
Close-up of a powerful GPU chip with intricate design, emphasizing advanced technology and innovation.

AMD's Instinct MI350 AI accelerator, featuring the CDNA 4 architecture, has been fully detailed, with its speeds and feeds, at Hot Chips 2025.

AMD Opens Up The Lid of Instinct MI350 Architectural Details, Products, & Solutions At Hot Chips 2025, Ready For Massive LLMs

It's been only two months since AMD launched its Instinct MI350 series, the flagship Accelerator and CDNA 4-based GPU for AI workloads. Today at Hot Chips, they went further into the details of this AI powerhouse.

Related Story Edge of Memories Composer Believes AI Is Useful Tool, But Slams It For Generating Art: “Art Cannot Exist Without Humanity”

So, starting off with what kicked off the development of the MI350 series, well, AI obviously, but to be more precise, it was the LLM growth trajectory as model sizes were getting larger each year. Two key factors to address these were to innovate on the data type format front, and another was to simply go bigger on the memory scale on chips. AMD implemented both and a lot more.

As a result, the CDNA-4-based AMD Instinct MI350 series accelerators improve performance and efficiency in doing AI workloads. They extended the HBM bandwidth and capacity, supporting faster AI training and inference on larger models with increased link speeds, and also enhanced power efficiency and performance.

The faster performance is achieved by reducing un-core power, enabling a wider infinity fabric for higher bandwidth at more power-efficient frequencies, and supporting lower precision data formats such as full-access FP8, and industry-standard micro-scaled MXFP6 and MXFP4 data types.

AMD offers its MI350 series in two flavors, the MI350X, which is the air-cooled variant with a 1000W TBP and a max clock speed of 2.2 GHz, while the higher-end MI355X is aimed at liquid-cooled datacenters with a max TBP of 1400W and a max clock speed of 2.4 GHz.

The chip is an architectural masterpiece from AMD, utilizing its years of engineering expertise in the chiplet domain, while utilizing the prowess of its partners for advanced packaging. The chip itself has a total of 185 billion transistors and adopts a 3D Multi-Chiplet layout with two chiplet types, along with HBM3e memory. A dual 3nm + 6nm process technology was leveraged for the MI350 series on the proven COWOS-S packaging technology.

Breaking down the chip, we first have the XCDs or Accelerator Complex Dies, which are based on TSMC's N3P "3nm" process technology. There are 8 of these on a single MI350X/MI355X package and 4 each on an IOD. The IOD or AMD I/O Base Die is based on TSMC's 6nm FinFET "N6" process technology and is a very cost-effective die thanks to its mature process node, which is optimal in terms of yields and costs. There are two of these per package. The IOD houses the Infinity Fabric AP interconnect.

There are a total of 8 HBM3E sites on the package, with each IOD connected to 4 sites. And lastly, there's the main interposer or package on which the entire silicon sits.

Diving deeper into the IO die, there are two of these, each with three Infinity Fabric Links and a PCIe Gen5 link to an AMD EPYC Host (128 GB/s). There are four HBM3E memory controllers, each connected to a 12-Hi stack comprised of 36 GB capacities operating at 8 Gbps for up to 8.0 TB/s of bandwidth. There's 288 GB of HBM3e capacity onboard the package.

Both IO dies are connected using an Infinity Fabric (Advanced Package) interconnect, which offers 5.5 TB/s of bisection bandwidth. There's also 256 MB of AMD Infinity Cache onboard the IO Dies. The Infinity Fabric Links are based on 4th Gen inter-socket links and offer 1075 GB/s bi-directional aggregate bandwidth to the XCDs.

The MI350 series chips pack a total of 32 AMD CDNA 4 compute units per XCD or 256 compute units in total with 128 stream processors per CU for a total of 16,384 cores. These are lower cores than the MI325 and MI300 series, which came packed with 304 compute units and a max core count of 19,456. These compute units are adjusted into eight zones, each with its own XCD, with each XCD packing 32 compute units. There are also 1024 Matrix Cores, and the core can hit a maximum clock speed of 2.4 GHz on the MI355X-class solutions.

The internal memory subsystem onboard the XCD includes 129 KiB of VGPR / SIMD, 512 KiB of Vector Registers/CU, 160 KiB of LDS/CU (537 GB/s), 32 KiB of L1 cache per CU, and 4 MiB of shared L2 cache per XCD. That gives us:

Moving down, AMD is sharing the data format and compute performance speedups of its MI355X versus MI300X:

Compared to NVIDIA's GB200 SXM systems, the MI355X OAM solution offers a 2.1x higher compute output in AI and HPC performance.

You can see the SoC block diagram of the Instinct MI350 series GPU below:

The AMD Instinct MI350 series AI accelerators also support flexible GPU partitioning per socket, where the memory can be partitioned into two separate clusters. This flexibility also applies to the GPUs or XCDs, where you can separate the quad XCD cluster or separate them into dual or singular blocks, allowing the chip to support 8 instances of 70B models in CPX+NPS2.

The Infinity Fabric connectivity also enables 8 accelerators to communicate with a bi-directional link of 154 GB/s, a 20% speedup versus the prior generation.

AMD also talks a bit about the assembly of each chip, from 3D packaging of the silicon to the package assembly, to OAM assembly, and the final heatsink attach phase. These OAMs then go into massive UBBs (2.0), which are universal base boards that house up to 8 accelerators. These go into an industry-standard host node, which ends up in a datacenter-ready EIA rack.

Just talking about the AI compute uplift, AMD claims that the Instinct MI350 series offers 20 PFLOPs of FP4/FP6 compute, which is a 4x gen-on-gen performance uplift. With HBM3e, you get faster data transfer speeds with a super-high capacity of 288 GB on both variants. There's also 256 MB of new Infinity Cache on the chips.

The 4U options can also fit into existing UBB8, which currently houses MI300X AC 750W and MI325X AC 1000W accelerators.

There are two finalized systems. The MI350X platform offers up to 36.9 FP16/BF16 and 73.9 FP8 PFLOPs and scales up to 10U air-cooled solutions. The MI355X platform offers up to 40.2 FP16/BF16 and 80.5 FP8 PFLOPs and scales up to 5U DLC (Direct Liquid Cooled) solutions. Both platforms offer 2.25 TB of HBM3e memory and 1075 GB/s of Infinity Fabric Bandwidth. These solutions are equipped with AMD's latest and greatest 5th Gen EPYC CPUs with Zen 5 cores and Pensando UEC-ready NICs.

The following are the numbers compared against the competition:

MI355x vs B200:

MI355x vs GB200:

But how does Instinct MI355X compare to the last-gen MI300 series? Well, AMD just showed a massive 35x leap in Inference performance using Llama 3.1 405B (Throughput), and that's a huge increase.

AMD has already confirmed that the MI350 series will be available through various partners starting in Q3 2025. The next-generation MI400 series is already in the works and is planned for launch in 2026.

AMD Instinct AI Accelerators:

Accelerator NameAMD Instinct MI500AMD Instinct MI400AMD Instinct MI350XAMD Instinct MI325XAMD Instinct MI300XAMD Instinct MI250X
GPU ArchitectureCDNA 6CDNA 5CDNA 4Aqua Vanjaram (CDNA 3)Aqua Vanjaram (CDNA 3)Aldebaran (CDNA 2)
GPU Process Node2nm2nm+3nm3nm5nm+6nm5nm+6nm6nm
XCDs (Chiplets)TBD8 (MCM)8 (MCM)8 (MCM)8 (MCM)2 (MCM)
1 (Per Die)
GPU CoresTBDTBD16,38419,45619,45614,080
GPU Clock Speed (Max)TBDTBD2400 MHz2100 MHz2100 MHz1700 MHz
INT8 ComputeTBDTBD5200 TOPS2614 TOPS2614 TOPS383 TOPs
FP6/FP4 MatrixTBD40 PFLOPs20 PFLOPsN/AN/AN/A
FP8 MatrixTBD20 PFLOPs5 PFLOPs2.6 PFLOPs2.6 PFLOPsN/A
FP16 MatrixTBD10 PFLOPs2.5 PFLOPs1.3 PFLOPs1.3 PFLOPs383 TFLOPs
FP32 VectorTBDTBD157.3 TFLOPs163.4 TFLOPs163.4 TFLOPs95.7 TFLOPs
FP64 VectorTBDTBD78.6 TFLOPs81.7 TFLOPs81.7 TFLOPs47.9 TFLOPs
VRAMHBM4E432 GB HBM4288 GB HBM3e256 GB HBM3e192 GB HBM3128 GB HBM2e
Infinity CacheTBDTBD256 MB256 MB256 MBN/A
Memory ClockTBD19.6 TB/s8.0 Gbps5.9 Gbps5.2 Gbps3.2 Gbps
Memory BusTBDTBD8192-bit8192-bit8192-bit8192-bit
Memory BandwidthTBDTBD8 TB/s6.0 TB/s5.3 TB/s3.2 TB/s
Form FactorTBDTBDOAMOAMOAMOAM
CoolingTBDPassive / LiquidPassive / LiquidPassive CoolingPassive CoolingPassive Cooling
TDP (Max)TBDTBD1400W (355X)1000W750W560W

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.