AMD Radeon Instinct MI60, The First 7nm Vega 20 GPU Based 32GB HBM2 Graphics Card Detailed – 13.2 Billion Transistors on a 331mm2 Die, 7.4 TFLOPs Double Precision Compute, 1 TB/s Bandwidth

Author Photo
Nov 6

AMD has also announced their latest Radeon Instinct MI60 graphics accelerator, which is also the worlds first 7nm graphics card to have been publicly shown. The graphics accelerator is aimed at the HPC market and uses the latest 7nm Vega 20 GPU to deliver unprecedented density increase and unprecedented amounts of compute and bandwidth.

AMD Radeon Instinct MI60, Worlds First 7nm Graphics Accelerator Detailed – 64 Compute Units, 32 GB HBM2, 1 TB/s Bandwidth, and PCIe Gen 4.0 Support

There’s a lot to talk about so let’s start with the specifications. The AMD Radeon Instinct MI60 uses the Vega 20 GPU which is AMD’s first 7nm GPU. The design of the 14nm Vega was ported over to 7nm and priority optimized for the HPC sector. This in return gave AMD a chance to fully utilize the Vega architecture, leveraging it compute capabilities and taking them a step ahead.

nvidialogoRelated Semiconductor Stocks Tumble on Weak NVIDIA Guidance – Shorts Celebrate

The Vega 20 GPU features a total of 13.23 Billion transistors which are packed within a 331mm2 die. It’s definitely a really dense design and you will note that AMD has also slightly optimized their GCN cores on Vega 20. With 7nm, AMD can optimize it run at faster clock speeds, allowing for up to 7.4 TFLOPs of double precision compute, twice that in the single precision ops of 14.8 TFLOPs and similarly, twice of that in half precisions ops, rated at 29.5 TFLOPs.

amd-epyc-8Related AMD Announces Their Fastest Clocked EPYC CPU Yet – The EPYC 7371 With Up To 3.8 GHz Clocks Across 16 Cores

There still are 64 compute units which make up 4096 stream processors but as I mentioned before, they have been vastly optimized for the HPC market, hence delivering faster compute operations and adding DLL/ML instruction sets. Talking about Deep Learning operations, the Instinct MI60 now supports both INT8 and INT4 with a maximum theoretical compute power rated at 118 TFLOPs in INT4 and 59.0 TFLOPs in INT8.

In terms of memory, we are looking at 32 GB of HBM2 VRAM that features an unprecedented bandwidth of 1 TB/s. AMD is using four stacks of HBM2 that use an 8-Hi design and allowing for the biggest and densest VRAM capacity ever featured on a single chip GPU. In addition to the specifications, the Radeon Instinct MI60 is fully compliant with AMD’s ROCM software stack, additionally making use of a new machine learning engine that will extend AMD’s efforts in the Deep Learning and Artifical Intelligence space.

Key features of the AMD Radeon Instinct MI60 and MI50 accelerators include:

  • Optimized Deep Learning Operations: Provides flexible mixed-precision FP16, FP32, and INT4/INT8 capabilities to meet growing demand for dynamic and ever-changing workloads, from training complex neural networks to running inference against those trained networks.
  • World’s Fastest Double Precision PCIe Accelerator: The AMD Radeon Instinct MI60 is the world’s fastest double precision PCIe 4.0 capable accelerator, delivering up to 7.4 TFLOPS peak FP64 performance allowing scientists and researchers to more efficiently process HPC applications across a range of industries including life sciences, energy, finance, automotive, aerospace, academics, government, defense and more. The AMD Radeon Instinct MI50 delivers up to 6.7 TFLOPS FP64 peak performance while providing an efficient, cost-effective solution for a variety of deep learning workloads, as well as enabling high reuse in Virtual Desktop Infrastructure (VDI), Desktop-as-a-Service (DaaS) and cloud environments.
  • Up to 6X Faster Data Transfer: Two Infinity Fabric Links per GPU deliver up to 200 GB/s of peer-to-peer bandwidth – up to 6X faster than PCIe 3.0 alone – and enable the connection of up to 4 GPUs in a hive ring configuration (2 hives in 8 GPU servers).
  • Ultra-Fast HBM2 Memory: The AMD Radeon Instinct MI60 provides 32GB of HBM2 Error-correcting code (ECC) memory, and the Radeon Instinct MI50 provides 16GB of HBM2 ECC memory. Both GPUs provide full-chip ECC and Reliability, Accessibility and Serviceability (RAS) technologies, which are critical to delivering more accurate compute results for large-scale HPC deployments.
  • Secure Virtualized Workload Support: AMD MxGPU Technology, the industry’s only hardware-based GPU virtualization solution, which is based on the industry-standard SR-IOV (Single Root I/O Virtualization) technology, makes it difficult for hackers to attack at the hardware level, helping provide security for virtualized cloud deployments.

AMD has also shared a roadmap which showcases that a new Radeon Instinct product, currently termed as “MI-Next” will be launching next year, featuring higher performance, increased connectivity, and better software compatibility. As for the Radeon Instinct MI60, it is expected to ship this quarter which indeed makes it the first 7nm graphics card to hit the market as there’s no other 7nm GPU product from competition in the near horizon.

There will also be the Radeon Instinct MI50 accelerator, a slightly toned downed variant of the MI60, with 3840 cores, 16 GB HBM2 and slightly lower compute rates but aiming the machine inferencing market at a better tuned price point. Both cards would feature a TDP of 300W and power connectors wise, the MI60 would be equipped with dual 8 pin while the MI50 will use a 8+6 pin connector configuration.

AMD Radeon Instinct MI60/MI50 GPU Block Diagram and Performance Slides:

AMD Radeon Instinct Accelerators:

Accelerator Name AMD Radeon Instinct MI6 AMD Radeon Instinct MI8 AMD Radeon Instinct MI25 AMD Radeon Instinct MI60 AMD Radeon Instinct MI60
GPU Architecture Polaris 10 Fiji XT Vega 10 Vega 20 Vega 20
GPU Process Node 14nm FinFET 28nm 14nm FinFET 7nm FinFET 7nm FinFET
GPU Cores 2304 4096 4096 3840 4096
GPU Clock Speed 1237 MHz 1000 MHz 1500 MHz 1746 MHz 1800 MHz
FP16 Compute 5.7 TFLOPs 8.2 TFLOPs 24.6 TFLOPs 26.8 TFLOPs 29.6 TFLOPs
FP32 Compute 5.7 TFLOPs 8.2 TFLOPs 12.3 TFLOPs 13.4 TFLOPs 14.8 TFLOPs
FP64 Compute 384 GFLOPs 512 GFLOPs 768 GFLOPs 6.7 TFLOPs 7.4 TFLOPs
Memory Clock 1750 MHz 500 MHz 472 MHz 500 MHz 500 MHz
Memory Bus 256-bit bus 4096-bit bus 2048-bit bus 4096-bit bus 4096-bit bus
Memory Bandwidth 224 GB/s 512 GB/s 484 GB/s 1 TB/s 1 TB/s
Form Factor Single Slot, Full Length Dual Slot, Half Length Dual Slot, Full Length Dual Slot, Full Length Dual Slot, Full Length
Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling
TDP 150W 175W 300W 300W 300W