⋮    ⋮  

NVIDIA Volta Tesla V100 and AMD Vega 10 GPUs Detailed at Hot Chips 2017 – The Flagship Compute Powerhouses of Team Green and Red


NVIDIA and AMD, the graphics giants of the modern world have detailed their next generation GPU architectures at Hot Chips 2017. The latest details include an in-depth look at the NVIDIA Volta and AMD Vega GPUs which are powering high performance machines to complete the most demanding tasks known to mankind.

NVIDIA Volta and AMD Vega GPUs Detailed at Hot Chips 2017 - The Most In-Depth Look You'll Ever Get at the Modern Day Powerhouses of the Graphics World

The NVIDIA Volta and AMD Vega GPUs were introduced this year. Both GPU architectures are created for the most demanding graphics tasks that the HPC market can think of. We already detailed the NVIDIA Volta and AMD Vega architectures in detail but here's an even more detailed look at the latest generation of FinFET GPUs from each company.

Microsoft Azure Upgraded To AMD Instinct MI200 GPU Clusters For ‘Large-Scale’ AI training, Offers 20% Performance Improvement Over NVIDIA A100 GPUs

NVIDIA Tesla V100 - Built With the Tesla GV100 Chip, Up To 84 SMs With 5376 CUDA Cores

First up, we have the NVIDIA Tesla V100 which is built using the Volta GV100 GPU. The GPU consists of 21 Billion transistors that are packaged in a die measuring 815mm2. The Tesla V100 specifically uses 80 of the 84 SMs available on the GV100 GPU. That is to enable higher yields without the need to sacrifice much of the compute capabilities. With the 80 SMs on board the GPU, we get 5120 cuda cores, 2560 double precision cores and 640 tensor cores. There's also 16 GB of HBM2 on board the chip and by chip, we are referring to the interposer that houses the GPU and VRAM.

The HBM2 that is featured on Tesla V100 is by far the fastest we have seen to date and developed by Samsung. It's running at 900 GB/s which is almost a terabyte worth of bandwidth. There's also the NVLINK interconnect that runs at 300 GB/s per GPU, offering much higher communication speeds than the one released a generation prior on Pascal GPUs which was rated at 160 GB/s.

When we compare raw performance stats, the results look absolutely shocking. The Tesla P100 released last year but the Tesla V100 is faster and better in ever possible way. The Deep Learning speed has gone up by a factor of 12x (120 TFLOPs vs 10 TFLOPs), the Deep Learning Inference has gone up by a factor of 6x (120 TFLOPs vs 21 TFLOPs), Single and Half precision compute is up by 50% while both cache and bandwidth have received significant updates.

AMD On Threadripper HEDT CPUs: ‘Threadripper Isn’t Going Anywhere’ & ‘More Coming’

NVIDIA mentions that they have achieved a 50% increase in efficiency per SM with Tesla V100 compared to Tesla P100 and the improved SIMT architecture along with tensor acceleration that can deliver up to 9.3x speedup in certain workloads (provided the CUDA 9 software optimization) will make Volta a revolutionary step in GPU engineering for them. You can read more about the NVIDIA Volta GV100 GPU and the Tesla V100 accelerator here.

NVIDIA Volta Tesla V100S Specs:

NVIDIA Tesla Graphics CardTesla K40
Tesla M40
Tesla P100
Tesla P100 (SXM2)Tesla V100 (PCI-Express)Tesla V100 (SXM2)Tesla V100S (PCIe)
GPUGK110 (Kepler)GM200 (Maxwell)GP100 (Pascal)GP100 (Pascal)GV100 (Volta)GV100 (Volta)GV100 (Volta)
Process Node28nm28nm16nm16nm12nm12nm12nm
Transistors7.1 Billion8 Billion15.3 Billion15.3 Billion21.1 Billion21.1 Billion21.1 Billion
GPU Die Size551 mm2601 mm2610 mm2610 mm2815mm2815mm2815mm2
CUDA Cores Per SM1921286464646464
CUDA Cores (Total)2880307235843584512051205120
Texture Units240192224224320320320
FP64 CUDA Cores / SM6443232323232
FP64 CUDA Cores / GPU9609617921792256025602560
Base Clock745 MHz948 MHz1190 MHz1328 MHz1230 MHz1297 MHzTBD
Boost Clock875 MHz1114 MHz1329MHz1480 MHz1380 MHz1530 MHz1601 MHz
FP16 ComputeN/AN/A18.7 TFLOPs21.2 TFLOPs28.0 TFLOPs30.4 TFLOPs32.8 TFLOPs
FP32 Compute5.04 TFLOPs6.8 TFLOPs10.0 TFLOPs10.6 TFLOPs14.0 TFLOPs15.7 TFLOPs16.4 TFLOPs
FP64 Compute1.68 TFLOPs0.2 TFLOPs4.7 TFLOPs5.30 TFLOPs7.0 TFLOPs7.80 TFLOPs8.2 TFLOPs
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM
Memory Size12 GB GDDR5 @ 288 GB/s24 GB GDDR5 @ 288 GB/s16 GB HBM2 @ 732 GB/s
12 GB HBM2 @ 549 GB/s
16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 1134 GB/s
L2 Cache Size1536 KB3072 KB4096 KB4096 KB6144 KB6144 KB6144 KB

AMD Vega 10 - The First Flagship Radeon GPU in More Than Two Years Detailed

AMD released their flagship Vega GPU, the Vega 64 earlier this month. It's aiming at the high-end segment which has seen no action by team Radeon for the last two years, but now the wait is over. The AMD Vega 10 GPU or Vega 64 (the full fat die) is based on the 14nm FinFET architecture from GlobalFoundries and has a die size of 486mm2. It houses 12.5 Billion transistors. The package is measuring at 2256mm2 compared to 2500mm2 of the Fiji chip. The power envelope is stated at 150-300W which means there will be significantly more variants of this chip.

The last part is confirmed through another interesting detail in the slide which mentions that the 2 stacks of HBM2 can incorporate 4 GB, 8 GB and 16 GB of VRAM. We have already seen 8 GB and 16 GB variants of the chip but it shows that there are still more to come. A 4 GB HBM2 VRAM size for a Vega 10 SKU that ships with a TDP under 200W will be pretty sweet.

When it comes to the architectural layout, Vega is different compared to Fiji in many aspects. The block diagram of the Vega 10 core shows that the chip consists of a single graphics engine with 4 ACE (Asynchronous Compute Engine) units, 2 SDMA (System DMA) units and a fully operational IF (Infinity Fabric) interconnect that runs within the GPU. The graphics engine consists of 4 DSBRs (Draw Stream Binning Rasterizers), Flexible Geometry Engines, 64 Pixel Units and 256 texture units. The Unified Compute Engine is made up of 64 NCUs (Next Compute Units) that house 4096 stream processors and also 4 MB of L2 cache.

AMD Radeon Instinct Accelerators:

Accelerator NameAMD Radeon Instinct MI6AMD Radeon Instinct MI8AMD Radeon Instinct MI25AMD Radeon Instinct MI60AMD Radeon Instinct MI60
GPU ArchitecturePolaris 10Fiji XTVega 10Vega 20Vega 20
GPU Process Node14nm FinFET28nm14nm FinFET7nm FinFET7nm FinFET
GPU Cores23044096409638404096
GPU Clock Speed1237 MHz1000 MHz1500 MHz1746 MHz1800 MHz
FP16 Compute5.7 TFLOPs8.2 TFLOPs24.6 TFLOPs26.8 TFLOPs29.6 TFLOPs
FP32 Compute5.7 TFLOPs8.2 TFLOPs12.3 TFLOPs13.4 TFLOPs14.8 TFLOPs
FP64 Compute384 GFLOPs512 GFLOPs768 GFLOPs6.7 TFLOPs7.4 TFLOPs
Memory Clock1750 MHz500 MHz472 MHz500 MHz500 MHz
Memory Bus256-bit bus4096-bit bus2048-bit bus4096-bit bus4096-bit bus
Memory Bandwidth224 GB/s512 GB/s484 GB/s1 TB/s1 TB/s
Form FactorSingle Slot, Full LengthDual Slot, Half LengthDual Slot, Full LengthDual Slot, Full LengthDual Slot, Full Length
CoolingPassive CoolingPassive CoolingPassive CoolingPassive CoolingPassive Cooling

AMD has put a lot of emphasis on SR-IOV (Single Root I/O Virtualization) and reveal that the Vega 10 GPU can support up to 16 Virtual Machines at once. AMD is also mentioning their Rapid Packed Math technology which allows for 16-bit math operations. There's also talk of an AMD ROCM stack which is constantly being updated and improved to support the latest professional workloads so that AMD GPUs are properly optimized for various tasks, especially the Radeon Instinct line of GPU based accelerators which are utilizing the Vega architecture. You can learn more about the Radeon Vega Instinct accelerators here.