NVIDIA Volta Tesla V100 and AMD Vega 10 GPUs Detailed at Hot Chips 2017 – The Flagship Compute Powerhouses of Team Green and Red

Aug 29, 2017

NVIDIA and AMD, the graphics giants of the modern world have detailed their next generation GPU architectures at Hot Chips 2017. The latest details include an in-depth look at the NVIDIA Volta and AMD Vega GPUs which are powering high performance machines to complete the most demanding tasks known to mankind.

NVIDIA Volta and AMD Vega GPUs Detailed at Hot Chips 2017 – The Most In-Depth Look You’ll Ever Get at the Modern Day Powerhouses of the Graphics World

The NVIDIA Volta and AMD Vega GPUs were introduced this year. Both GPU architectures are created for the most demanding graphics tasks that the HPC market can think of. We already detailed the NVIDIA Volta and AMD Vega architectures in detail but here’s an even more detailed look at the latest generation of FinFET GPUs from each company.

Related AMD Ryzen 3000 Desktop CPUs, Radeon Navi GPUs and X570 Motherboards With PCIe Gen 4.0 Allegedly Launching on 7th July

NVIDIA Tesla V100 – Built With the Tesla GV100 Chip, Up To 84 SMs With 5376 CUDA Cores

First up, we have the NVIDIA Tesla V100 which is built using the Volta GV100 GPU. The GPU consists of 21 Billion transistors that are packaged in a die measuring 815mm2. The Tesla V100 specifically uses 80 of the 84 SMs available on the GV100 GPU. That is to enable higher yields without the need to sacrifice much of the compute capabilities. With the 80 SMs on board the GPU, we get 5120 cuda cores, 2560 double precision cores and 640 tensor cores. There’s also 16 GB of HBM2 on board the chip and by chip, we are referring to the interposer that houses the GPU and VRAM.

The HBM2 that is featured on Tesla V100 is by far the fastest we have seen to date and developed by Samsung. It’s running at 900 GB/s which is almost a terabyte worth of bandwidth. There’s also the NVLINK interconnect that runs at 300 GB/s per GPU, offering much higher communication speeds than the one released a generation prior on Pascal GPUs which was rated at 160 GB/s.

When we compare raw performance stats, the results look absolutely shocking. The Tesla P100 released last year but the Tesla V100 is faster and better in ever possible way. The Deep Learning speed has gone up by a factor of 12x (120 TFLOPs vs 10 TFLOPs), the Deep Learning Inference has gone up by a factor of 6x (120 TFLOPs vs 21 TFLOPs), Single and Half precision compute is up by 50% while both cache and bandwidth have received significant updates.

Related AMD Ryzen 3000 Notebooks Now Available – ASUS TUF Gaming FX705 and TUF Gaming FX505 With Ryzen 5 3550H

NVIDIA mentions that they have achieved a 50% increase in efficiency per SM with Tesla V100 compared to Tesla P100 and the improved SIMT architecture along with tensor acceleration that can deliver up to 9.3x speedup in certain workloads (provided the CUDA 9 software optimization) will make Volta a revolutionary step in GPU engineering for them. You can read more about the NVIDIA Volta GV100 GPU and the Tesla V100 accelerator here.

NVIDIA Volta Tesla V100 Specs:

NVIDIA Tesla Graphics Card Tesla K40
Tesla M40
Tesla P100
Tesla P100
Tesla P100 (SXM2) Tesla V100 (PCI-Express) Tesla V100 (SXM2)
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal) GP100 (Pascal) GV100 (Volta) GV100 (Volta)
Process Node 28nm 28nm 16nm 16nm 16nm 12nm 12nm
Transistors 7.1 Billion 8 Billion 15.3 Billion 15.3 Billion 15.3 Billion 21.1 Billion 21.1 Billion
GPU Die Size 551 mm2 601 mm2 610 mm2 610 mm2 610 mm2 815mm2 815mm2
SMs 15 24 56 56 56 80 80
TPCs 15 24 28 28 28 40 40
CUDA Cores Per SM 192 128 64 64 64 64 64
CUDA Cores (Total) 2880 3072 3584 3584 3584 5120 5120
FP64 CUDA Cores / SM 64 4 32 32 32 32 32
FP64 CUDA Cores / GPU 960 96 1792 1792 1792 2560 2560
Base Clock 745 MHz 948 MHz TBD TBD 1328 MHz TBD 1370 MHz
Boost Clock 875 MHz 1114 MHz 1300MHz 1300MHz 1480 MHz 1370 MHz 1455 MHz
FP16 Compute N/A N/A 18.7 TFLOPs 18.7 TFLOPs 21.2 TFLOPs 28.0 TFLOPs 30.0 TFLOPs
FP32 Compute 5.04 TFLOPs 6.8 TFLOPs 10.0 TFLOPs 10.0 TFLOPs 10.6 TFLOPs 14.0 TFLOPs 15.0 TFLOPs
FP64 Compute 1.68 TFLOPs 0.2 TFLOPs 4.7 TFLOPs 4.7 TFLOPs 5.30 TFLOPs 7.0 TFLOPs 7.50 TFLOPs
Texture Units 240 192 224 224 224 320 320
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2
Memory Size 12 GB GDDR5 @ 288 GB/s 24 GB GDDR5 @ 288 GB/s 12 GB HBM2 @ 549 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 900 GB/s 16 GB HBM2 @ 900 GB/s
L2 Cache Size 1536 KB 3072 KB 4096 KB 4096 KB 4096 KB 6144 KB 6144 KB
TDP 235W 250W 250W 250W 300W 250W 300W

AMD Vega 10 – The First Flagship Radeon GPU in More Than Two Years Detailed

AMD released their flagship Vega GPU, the Vega 64 earlier this month. It’s aiming at the high-end segment which has seen no action by team Radeon for the last two years, but now the wait is over. The AMD Vega 10 GPU or Vega 64 (the full fat die) is based on the 14nm FinFET architecture from GlobalFoundries and has a die size of 486mm2. It houses 12.5 Billion transistors. The package is measuring at 2256mm2 compared to 2500mm2 of the Fiji chip. The power envelope is stated at 150-300W which means there will be significantly more variants of this chip.

The last part is confirmed through another interesting detail in the slide which mentions that the 2 stacks of HBM2 can incorporate 4 GB, 8 GB and 16 GB of VRAM. We have already seen 8 GB and 16 GB variants of the chip but it shows that there are still more to come. A 4 GB HBM2 VRAM size for a Vega 10 SKU that ships with a TDP under 200W will be pretty sweet.

When it comes to the architectural layout, Vega is different compared to Fiji in many aspects. The block diagram of the Vega 10 core shows that the chip consists of a single graphics engine with 4 ACE (Asynchronous Compute Engine) units, 2 SDMA (System DMA) units and a fully operational IF (Infinity Fabric) interconnect that runs within the GPU. The graphics engine consists of 4 DSBRs (Draw Stream Binning Rasterizers), Flexible Geometry Engines, 64 Pixel Units and 256 texture units. The Unified Compute Engine is made up of 64 NCUs (Next Compute Units) that house 4096 stream processors and also 4 MB of L2 cache.

AMD Radeon Instinct Accelerators:

Accelerator Name AMD Radeon Instinct MI6 AMD Radeon Instinct MI8 AMD Radeon Instinct MI25 AMD Radeon Instinct MI60 AMD Radeon Instinct MI60
GPU Architecture Polaris 10 Fiji XT Vega 10 Vega 20 Vega 20
GPU Process Node 14nm FinFET 28nm 14nm FinFET 7nm FinFET 7nm FinFET
GPU Cores 2304 4096 4096 3840 4096
GPU Clock Speed 1237 MHz 1000 MHz 1500 MHz 1746 MHz 1800 MHz
FP16 Compute 5.7 TFLOPs 8.2 TFLOPs 24.6 TFLOPs 26.8 TFLOPs 29.6 TFLOPs
FP32 Compute 5.7 TFLOPs 8.2 TFLOPs 12.3 TFLOPs 13.4 TFLOPs 14.8 TFLOPs
FP64 Compute 384 GFLOPs 512 GFLOPs 768 GFLOPs 6.7 TFLOPs 7.4 TFLOPs
Memory Clock 1750 MHz 500 MHz 472 MHz 500 MHz 500 MHz
Memory Bus 256-bit bus 4096-bit bus 2048-bit bus 4096-bit bus 4096-bit bus
Memory Bandwidth 224 GB/s 512 GB/s 484 GB/s 1 TB/s 1 TB/s
Form Factor Single Slot, Full Length Dual Slot, Half Length Dual Slot, Full Length Dual Slot, Full Length Dual Slot, Full Length
Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling Passive Cooling
TDP 150W 175W 300W 300W 300W

AMD has put a lot of emphasis on SR-IOV (Single Root I/O Virtualization) and reveal that the Vega 10 GPU can support up to 16 Virtual Machines at once. AMD is also mentioning their Rapid Packed Math technology which allows for 16-bit math operations. There’s also talk of an AMD ROCM stack which is constantly being updated and improved to support the latest professional workloads so that AMD GPUs are properly optimized for various tasks, especially the Radeon Instinct line of GPU based accelerators which are utilizing the Vega architecture. You can learn more about the Radeon Vega Instinct accelerators here.