NVIDIA Volta Tesla V100 and AMD Vega 10 GPUs Detailed at Hot Chips 2017 – The Flagship Compute Powerhouses of Team Green and Red

Author Photo
Aug 29

NVIDIA and AMD, the graphics giants of the modern world have detailed their next generation GPU architectures at Hot Chips 2017. The latest details include an in-depth look at the NVIDIA Volta and AMD Vega GPUs which are powering high performance machines to complete the most demanding tasks known to mankind.

NVIDIA Volta and AMD Vega GPUs Detailed at Hot Chips 2017 – The Most In-Depth Look You’ll Ever Get at the Modern Day Powerhouses of the Graphics World

The NVIDIA Volta and AMD Vega GPUs were introduced this year. Both GPU architectures are created for the most demanding graphics tasks that the HPC market can think of. We already detailed the NVIDIA Volta and AMD Vega architectures in detail but here’s an even more detailed look at the latest generation of FinFET GPUs from each company.

01_tzrx_foramdRelatedG.Skill Unveils AMD Ryzen and Threadripper Optimized DDR4 Trident Z RGB Memory Kits – Available in Up To 128 GB Capacities For X399

NVIDIA Tesla V100 – Built With the Tesla GV100 Chip, Up To 84 SMs With 5376 CUDA Cores

First up, we have the NVIDIA Tesla V100 which is built using the Volta GV100 GPU. The GPU consists of 21 Billion transistors that are packaged in a die measuring 815mm2. The Tesla V100 specifically uses 80 of the 84 SMs available on the GV100 GPU. That is to enable higher yields without the need to sacrifice much of the compute capabilities. With the 80 SMs on board the GPU, we get 5120 cuda cores, 2560 double precision cores and 640 tensor cores. There’s also 16 GB of HBM2 on board the chip and by chip, we are referring to the interposer that houses the GPU and VRAM.

The HBM2 that is featured on Tesla V100 is by far the fastest we have seen to date and developed by Samsung. It’s running at 900 GB/s which is almost a terabyte worth of bandwidth. There’s also the NVLINK interconnect that runs at 300 GB/s per GPU, offering much higher communication speeds than the one released a generation prior on Pascal GPUs which was rated at 160 GB/s.

tesla-gp100RelatedNVIDIA’s Tesla P100 Compute Accelerator Boosts Google’s Cloud Platform – Tesla K80 Also Available For High Performance, Scalable Virtual Machines

When we compare raw performance stats, the results look absolutely shocking. The Tesla P100 released last year but the Tesla V100 is faster and better in ever possible way. The Deep Learning speed has gone up by a factor of 12x (120 TFLOPs vs 10 TFLOPs), the Deep Learning Inference has gone up by a factor of 6x (120 TFLOPs vs 21 TFLOPs), Single and Half precision compute is up by 50% while both cache and bandwidth have received significant updates.

NVIDIA mentions that they have achieved a 50% increase in efficiency per SM with Tesla V100 compared to Tesla P100 and the improved SIMT architecture along with tensor acceleration that can deliver up to 9.3x speedup in certain workloads (provided the CUDA 9 software optimization) will make Volta a revolutionary step in GPU engineering for them. You can read more about the NVIDIA Volta GV100 GPU and the Tesla V100 accelerator here.

NVIDIA Volta Tesla V100 Specs:

NVIDIA Tesla Graphics CardTesla K40
Tesla M40
Tesla P100
Tesla P100
Tesla P100 (SXM2)Tesla V100 (PCI-Express)Tesla V100 (SXM2)
GPUGK110 (Kepler)GM200 (Maxwell)GP100 (Pascal)GP100 (Pascal)GP100 (Pascal)GV100 (Volta)GV100 (Volta)
Process Node28nm28nm16nm16nm16nm12nm12nm
Transistors7.1 Billion8 Billion15.3 Billion15.3 Billion15.3 Billion21.1 Billion21.1 Billion
GPU Die Size551 mm2601 mm2610 mm2610 mm2610 mm2815mm2815mm2
CUDA Cores Per SM1921286464646464
CUDA Cores (Total)2880307235843584358451205120
FP64 CUDA Cores / SM6443232323232
FP64 CUDA Cores / GPU9609617921792179225602560
Base Clock745 MHz948 MHzTBDTBD1328 MHzTBD1370 MHz
Boost Clock875 MHz1114 MHz1300MHz1300MHz1480 MHz1370 MHz1455 MHz
FP16 ComputeN/AN/A18.7 TFLOPs18.7 TFLOPs21.2 TFLOPs28.0 TFLOPs30.0 TFLOPs
FP32 Compute5.04 TFLOPs6.8 TFLOPs10.0 TFLOPs10.0 TFLOPs10.6 TFLOPs14.0 TFLOPs15.0 TFLOPs
FP64 Compute1.68 TFLOPs0.2 TFLOPs4.7 TFLOPs4.7 TFLOPs5.30 TFLOPs7.0 TFLOPs7.50 TFLOPs
Texture Units240192224224224320320
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM2
Memory Size12 GB GDDR5 @ 288 GB/s24 GB GDDR5 @ 288 GB/s12 GB HBM2 @ 549 GB/s16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 900 GB/s
L2 Cache Size1536 KB3072 KB4096 KB4096 KB4096 KB6144 KB6144 KB

AMD Vega 10 – The First Flagship Radeon GPU in More Than Two Years Detailed

AMD released their flagship Vega GPU, the Vega 64 earlier this month. It’s aiming at the high-end segment which has seen no action by team Radeon for the last two years, but now the wait is over. The AMD Vega 10 GPU or Vega 64 (the full fat die) is based on the 14nm FinFET architecture from GlobalFoundries and has a die size of 486mm2. It houses 12.5 Billion transistors. The package is measuring at 2256mm2 compared to 2500mm2 of the Fiji chip. The power envelope is stated at 150-300W which means there will be significantly more variants of this chip.

The last part is confirmed through another interesting detail in the slide which mentions that the 2 stacks of HBM2 can incorporate 4 GB, 8 GB and 16 GB of VRAM. We have already seen 8 GB and 16 GB variants of the chip but it shows that there are still more to come. A 4 GB HBM2 VRAM size for a Vega 10 SKU that ships with a TDP under 200W will be pretty sweet.

When it comes to the architectural layout, Vega is different compared to Fiji in many aspects. The block diagram of the Vega 10 core shows that the chip consists of a single graphics engine with 4 ACE (Asynchronous Compute Engine) units, 2 SDMA (System DMA) units and a fully operational IF (Infinity Fabric) interconnect that runs within the GPU. The graphics engine consists of 4 DSBRs (Draw Stream Binning Rasterizers), Flexible Geometry Engines, 64 Pixel Units and 256 texture units. The Unified Compute Engine is made up of 64 NCUs (Next Compute Units) that house 4096 stream processors and also 4 MB of L2 cache.

AMD Radeon Instinct Accelerators:

Accelerator NameAMD Radeon Instinct MI6AMD Radeon Instinct MI8AMD Radeon Instinct MI25
GPU ArchitecturePolaris 10Fiji XTVega 10
GPU Process Node14nm FinFET28nm14nm FinFET
GPU Cores230440964096
GPU Clock Speed1237 MHz1000 MHz1500 MHz
FP16 Compute5.7 TFLOPs8.2 TFLOPs24.6 TFLOPs
FP32 Compute5.7 TFLOPs8.2 TFLOPs12.3 TFLOPs
FP64 Compute384 GFLOPs512 GFLOPs768 GFLOPs
Memory Clock1750 MHz500 MHz472 MHz
Memory Bus256-bit bus4096-bit bus2048-bit bus
Memory Bandwidth224 GB/s512 GB/s484 GB/s
Form FactorSingle Slot, Full LengthDual Slot, Half LengthDual Slot, Full Length
CoolingPassive CoolingPassive CoolingPassive Cooling

AMD has put a lot of emphasis on SR-IOV (Single Root I/O Virtualization) and reveal that the Vega 10 GPU can support up to 16 Virtual Machines at once. AMD is also mentioning their Rapid Packed Math technology which allows for 16-bit math operations. There’s also talk of an AMD ROCM stack which is constantly being updated and improved to support the latest professional workloads so that AMD GPUs are properly optimized for various tasks, especially the Radeon Instinct line of GPU based accelerators which are utilizing the Vega architecture. You can learn more about the Radeon Vega Instinct accelerators here.