AMD's CDNA 2 powered Aldebaran GPU is headed for launch in the Instinct MI200 accelerator later this year. The GPU is going to feature an MCM design and will carry a massive amount of cores and memory. With all that we know about the chip so far, Twitter fellow, Locuza, has come up with a die block diagram of the full Aldebaran GPU and it looks like a beast.
AMD Aldebaran CDNA 2 GPU For Instinct MI200 HPC Accelerator Visualized - Up To 256 Compute Units, 128 GB HBM2e Memory
The block diagram is based on the latest details shared by Kepler_L2 for the CDNA 2 powered GPU. It was already confirmed that the Aldebaran GPU powering the Instinct MI200 will feature two dies, a secondary and a primary. The block diagram shows the two dies with each consisting of 8 shader engines for a total of 16 SE's. Each Shader Engine packs 16 CUs with full-rate FP64, packed FP32 & a 2nd Generation Matrix Engine for FP16 & BF16 operations.
Each die, as such, is composed of 128 compute units or 8192 stream processors. This rounds up to a total of 256 compute units or 16,384 stream processors for the entire chip. The Aldebaran GPU is also powered by a new XGMI interconnect. Each chiplet features a VCN 2.6 engine and the main IO controller.
The full config of Aldebaran is 2 dies x 8 Shader Engines x 16 Compute Units, but MI200 might be only 224 CUs enabled 🧐 https://t.co/RO7RptgYSk
— Kepler (@Kepler_L2) June 10, 2021
One interesting change relates to the vector units supporting full rate FP64 and packed FP32 ops.
64b DPP (Data-Parallel Primitives) were added.
The Matrix Units also support FP64 now, though I'm not sure if you get higher throughput vs. vector ALUs.
Both might do 128 ops/clk.— Locuza (@Locuza_) July 1, 2021
Moving over to DRAM, AMD has gone with an 8-channel interface consisting of 1024-bit interfaces for an 8192-bit wide bus interface. Each interface can support 2GB HBM2e DRAM modules. This should give us up to 16 GB of HBM2e memory capacity per stack and since there are eight stacks in total, the total amount of capacity would be a whopping 128 GB. That's 48 GB more than the A100 which houses 80 GB HBM2e memory. This would really be a juggernaut of an HPC GPU but we also expect some really high power figures when it launches.
As for the product itself, Kepler_L2 states that the actual AMD Instinct MI200 accelerator will utilize a cut-down configuration comprising 224 CUs or 14,336 cores. That's around 14% lower cores than what the full Aldebaran GPU die has to offer.
Here's Everything We Know About AMD's CDNA 2 Architecture Powered Instinct Accelerators
The AMD CDNA 2 architecture will be powering the next-generation AMD Instinct HPC accelerators. We know that one of those accelerators will be the MI200 which will feature the Aldebaran GPU. It's going to be a very powerful chip and possibly the first GPU to feature an MCM design. The Instinct MI200 is going to compete against Intel's 7nm Ponte Vecchio and NVIDIA's refreshed Ampere parts. Intel and NVIDIA are also following the MCM route on their next-generation HPC accelerators but it looks like Ponte Vecchio is going to be available in 2022 and the same can be said for NVIDIA's next-gen HPC accelerator as their own roadmap confirmed.

In the previous Linux patch, it was revealed that l that the AMD Instinct MI200 'Aldebaran' GPU will feature HBM2E memory support. NVIDIA was the first to hop on board the HBM2E standard and will offer a nice boost over the standard HBM2 configuration used on the Arcturus-based MI100 GPU accelerator.
The latest Linux Kernel Patch revealed that the GPU carries 16 KB of L1 cache per CU which makes up 2 MB of the total L1 cache considering that each GPU will be packing 128 Compute Units. The GPU also carries 8 MB of shared L2 cache but carries 14 CUs per Shader Engine compared to 16 CUs per SE in the previous Instinct lineup. Regardless, it is stated that each CU on Aldebaran GPUs will have a significantly higher computing output.

Other features listed include SDMA (System Direct Memory Access) support which will allow data transfers over PCIe and XGMI/Infinity Cache subsystems. As far as Infinity Cache is concerned, it's looking like that won't be happening on HPC GPUs. Do note that AMD's CDNA 2 GPU will be fabricated on a brand new process node & are confirmed to feature a 3rd Generation AMD Infinity architecture that extends to Exascale by allowing up to 8-Way coherent GPU connectivity.
AMD Radeon Instinct Accelerators
| Accelerator Name | AMD Instinct MI400 | AMD Instinct MI350X | AMD Instinct MI300X | AMD Instinct MI300A | AMD Instinct MI250X | AMD Instinct MI250 | AMD Instinct MI210 | AMD Instinct MI100 | AMD Radeon Instinct MI60 | AMD Radeon Instinct MI50 | AMD Radeon Instinct MI25 | AMD Radeon Instinct MI8 | AMD Radeon Instinct MI6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CPU Architecture | Zen 5 (Exascale APU) | N/A | N/A | Zen 4 (Exascale APU) | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| GPU Architecture | CDNA 4 | CDNA 3+? | Aqua Vanjaram (CDNA 3) | Aqua Vanjaram (CDNA 3) | Aldebaran (CDNA 2) | Aldebaran (CDNA 2) | Aldebaran (CDNA 2) | Arcturus (CDNA 1) | Vega 20 | Vega 20 | Vega 10 | Fiji XT | Polaris 10 |
| GPU Process Node | 4nm | 4nm | 5nm+6nm | 5nm+6nm | 6nm | 6nm | 6nm | 7nm FinFET | 7nm FinFET | 7nm FinFET | 14nm FinFET | 28nm | 14nm FinFET |
| GPU Chiplets | TBD | TBD | 8 (MCM) | 8 (MCM) | 2 (MCM) 1 (Per Die) | 2 (MCM) 1 (Per Die) | 2 (MCM) 1 (Per Die) | 1 (Monolithic) | 1 (Monolithic) | 1 (Monolithic) | 1 (Monolithic) | 1 (Monolithic) | 1 (Monolithic) |
| GPU Cores | TBD | TBD | 19,456 | 14,592 | 14,080 | 13,312 | 6656 | 7680 | 4096 | 3840 | 4096 | 4096 | 2304 |
| GPU Clock Speed | TBD | TBD | 2100 MHz | 2100 MHz | 1700 MHz | 1700 MHz | 1700 MHz | 1500 MHz | 1800 MHz | 1725 MHz | 1500 MHz | 1000 MHz | 1237 MHz |
| INT8 Compute | TBD | TBD | 2614 TOPS | 1961 TOPS | 383 TOPs | 362 TOPS | 181 TOPS | 92.3 TOPS | N/A | N/A | N/A | N/A | N/A |
| FP16 Compute | TBD | TBD | 1.3 PFLOPs | 980.6 TFLOPs | 383 TFLOPs | 362 TFLOPs | 181 TFLOPs | 185 TFLOPs | 29.5 TFLOPs | 26.5 TFLOPs | 24.6 TFLOPs | 8.2 TFLOPs | 5.7 TFLOPs |
| FP32 Compute | TBD | TBD | 163.4 TFLOPs | 122.6 TFLOPs | 95.7 TFLOPs | 90.5 TFLOPs | 45.3 TFLOPs | 23.1 TFLOPs | 14.7 TFLOPs | 13.3 TFLOPs | 12.3 TFLOPs | 8.2 TFLOPs | 5.7 TFLOPs |
| FP64 Compute | TBD | TBD | 81.7 TFLOPs | 61.3 TFLOPs | 47.9 TFLOPs | 45.3 TFLOPs | 22.6 TFLOPs | 11.5 TFLOPs | 7.4 TFLOPs | 6.6 TFLOPs | 768 GFLOPs | 512 GFLOPs | 384 GFLOPs |
| VRAM | TBD | HBM3e | 192 GB HBM3 | 128 GB HBM3 | 128 GB HBM2e | 128 GB HBM2e | 64 GB HBM2e | 32 GB HBM2 | 32 GB HBM2 | 16 GB HBM2 | 16 GB HBM2 | 4 GB HBM1 | 16 GB GDDR5 |
| Infinity Cache | TBD | TBD | 256 MB | 256 MB | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| Memory Clock | TBD | TBD | 5.2 Gbps | 5.2 Gbps | 3.2 Gbps | 3.2 Gbps | 3.2 Gbps | 1200 MHz | 1000 MHz | 1000 MHz | 945 MHz | 500 MHz | 1750 MHz |
| Memory Bus | TBD | TBD | 8192-bit | 8192-bit | 8192-bit | 8192-bit | 4096-bit | 4096-bit bus | 4096-bit bus | 4096-bit bus | 2048-bit bus | 4096-bit bus | 256-bit bus |
| Memory Bandwidth | TBD | TBD | 5.3 TB/s | 5.3 TB/s | 3.2 TB/s | 3.2 TB/s | 1.6 TB/s | 1.23 TB/s | 1 TB/s | 1 TB/s | 484 GB/s | 512 GB/s | 224 GB/s |
| Form Factor | TBD | TBD | OAM | APU SH5 Socket | OAM | OAM | Dual Slot Card | Dual Slot, Full Length | Dual Slot, Full Length | Dual Slot, Full Length | Dual Slot, Full Length | Dual Slot, Half Length | Single Slot, Full Length |
| Cooling | TBD | TBD | Passive Cooling | Passive Cooling | Passive Cooling | Passive Cooling | Passive Cooling | Passive Cooling | Passive Cooling | Passive Cooling | Passive Cooling | Passive Cooling | Passive Cooling |
| TDP (Max) | TBD | TBD | 750W | 760W | 560W | 500W | 300W | 300W | 300W | 300W | 300W | 175W | 150W |
Follow Wccftech on Google to get more of our news coverage in your feeds.






