NVIDIA Volta Tesla V100 and AMD Vega 10 GPUs Detailed at Hot Chips 2017 – The Flagship Compute Powerhouses of Team Green and Red
NVIDIA and AMD, the graphics giants of the modern world have detailed their next generation GPU architectures at Hot Chips 2017. The latest details include an in-depth look at the NVIDIA Volta and AMD Vega GPUs which are powering high performance machines to complete the most demanding tasks known to mankind.
NVIDIA Volta and AMD Vega GPUs Detailed at Hot Chips 2017 – The Most In-Depth Look You’ll Ever Get at the Modern Day Powerhouses of the Graphics World
The NVIDIA Volta and AMD Vega GPUs were introduced this year. Both GPU architectures are created for the most demanding graphics tasks that the HPC market can think of. We already detailed the NVIDIA Volta and AMD Vega architectures in detail but here’s an even more detailed look at the latest generation of FinFET GPUs from each company.
NVIDIA Tesla V100 – Built With the Tesla GV100 Chip, Up To 84 SMs With 5376 CUDA Cores
First up, we have the NVIDIA Tesla V100 which is built using the Volta GV100 GPU. The GPU consists of 21 Billion transistors that are packaged in a die measuring 815mm2. The Tesla V100 specifically uses 80 of the 84 SMs available on the GV100 GPU. That is to enable higher yields without the need to sacrifice much of the compute capabilities. With the 80 SMs on board the GPU, we get 5120 cuda cores, 2560 double precision cores and 640 tensor cores. There’s also 16 GB of HBM2 on board the chip and by chip, we are referring to the interposer that houses the GPU and VRAM.
The HBM2 that is featured on Tesla V100 is by far the fastest we have seen to date and developed by Samsung. It’s running at 900 GB/s which is almost a terabyte worth of bandwidth. There’s also the NVLINK interconnect that runs at 300 GB/s per GPU, offering much higher communication speeds than the one released a generation prior on Pascal GPUs which was rated at 160 GB/s.
When we compare raw performance stats, the results look absolutely shocking. The Tesla P100 released last year but the Tesla V100 is faster and better in ever possible way. The Deep Learning speed has gone up by a factor of 12x (120 TFLOPs vs 10 TFLOPs), the Deep Learning Inference has gone up by a factor of 6x (120 TFLOPs vs 21 TFLOPs), Single and Half precision compute is up by 50% while both cache and bandwidth have received significant updates.
NVIDIA mentions that they have achieved a 50% increase in efficiency per SM with Tesla V100 compared to Tesla P100 and the improved SIMT architecture along with tensor acceleration that can deliver up to 9.3x speedup in certain workloads (provided the CUDA 9 software optimization) will make Volta a revolutionary step in GPU engineering for them. You can read more about the NVIDIA Volta GV100 GPU and the Tesla V100 accelerator here.
NVIDIA Volta Tesla V100 Specs:
|NVIDIA Tesla Graphics Card||Tesla K40
|Tesla P100 (SXM2)||Tesla V100 (PCI-Express)||Tesla V100 (SXM2)|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)||GP100 (Pascal)||GP100 (Pascal)||GV100 (Volta)||GV100 (Volta)|
|Transistors||7.1 Billion||8 Billion||15.3 Billion||15.3 Billion||15.3 Billion||21.1 Billion||21.1 Billion|
|GPU Die Size||551 mm2||601 mm2||610 mm2||610 mm2||610 mm2||815mm2||815mm2|
|CUDA Cores Per SM||192||128||64||64||64||64||64|
|CUDA Cores (Total)||2880||3072||3584||3584||3584||5120||5120|
|FP64 CUDA Cores / SM||64||4||32||32||32||32||32|
|FP64 CUDA Cores / GPU||960||96||1792||1792||1792||2560||2560|
|Base Clock||745 MHz||948 MHz||TBD||TBD||1328 MHz||TBD||1370 MHz|
|Boost Clock||875 MHz||1114 MHz||1300MHz||1300MHz||1480 MHz||1370 MHz||1455 MHz|
|FP16 Compute||N/A||N/A||18.7 TFLOPs||18.7 TFLOPs||21.2 TFLOPs||28.0 TFLOPs||30.0 TFLOPs|
|FP32 Compute||5.04 TFLOPs||6.8 TFLOPs||10.0 TFLOPs||10.0 TFLOPs||10.6 TFLOPs||14.0 TFLOPs||15.0 TFLOPs|
|FP64 Compute||1.68 TFLOPs||0.2 TFLOPs||4.7 TFLOPs||4.7 TFLOPs||5.30 TFLOPs||7.0 TFLOPs||7.50 TFLOPs|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2||4096-bit HBM2||4096-bit HBM2||4096-bit HBM2||4096-bit HBM2|
|Memory Size||12 GB GDDR5 @ 288 GB/s||24 GB GDDR5 @ 288 GB/s||12 GB HBM2 @ 549 GB/s||16 GB HBM2 @ 732 GB/s||16 GB HBM2 @ 732 GB/s||16 GB HBM2 @ 900 GB/s||16 GB HBM2 @ 900 GB/s|
|L2 Cache Size||1536 KB||3072 KB||4096 KB||4096 KB||4096 KB||6144 KB||6144 KB|
AMD Vega 10 – The First Flagship Radeon GPU in More Than Two Years Detailed
AMD released their flagship Vega GPU, the Vega 64 earlier this month. It’s aiming at the high-end segment which has seen no action by team Radeon for the last two years, but now the wait is over. The AMD Vega 10 GPU or Vega 64 (the full fat die) is based on the 14nm FinFET architecture from GlobalFoundries and has a die size of 486mm2. It houses 12.5 Billion transistors. The package is measuring at 2256mm2 compared to 2500mm2 of the Fiji chip. The power envelope is stated at 150-300W which means there will be significantly more variants of this chip.
The last part is confirmed through another interesting detail in the slide which mentions that the 2 stacks of HBM2 can incorporate 4 GB, 8 GB and 16 GB of VRAM. We have already seen 8 GB and 16 GB variants of the chip but it shows that there are still more to come. A 4 GB HBM2 VRAM size for a Vega 10 SKU that ships with a TDP under 200W will be pretty sweet.
When it comes to the architectural layout, Vega is different compared to Fiji in many aspects. The block diagram of the Vega 10 core shows that the chip consists of a single graphics engine with 4 ACE (Asynchronous Compute Engine) units, 2 SDMA (System DMA) units and a fully operational IF (Infinity Fabric) interconnect that runs within the GPU. The graphics engine consists of 4 DSBRs (Draw Stream Binning Rasterizers), Flexible Geometry Engines, 64 Pixel Units and 256 texture units. The Unified Compute Engine is made up of 64 NCUs (Next Compute Units) that house 4096 stream processors and also 4 MB of L2 cache.
AMD Radeon Instinct Accelerators:
|Accelerator Name||AMD Radeon Instinct MI6||AMD Radeon Instinct MI8||AMD Radeon Instinct MI25||AMD Radeon Instinct MI60||AMD Radeon Instinct MI60|
|GPU Architecture||Polaris 10||Fiji XT||Vega 10||Vega 20||Vega 20|
|GPU Process Node||14nm FinFET||28nm||14nm FinFET||7nm FinFET||7nm FinFET|
|GPU Clock Speed||1237 MHz||1000 MHz||1500 MHz||1746 MHz||1800 MHz|
|FP16 Compute||5.7 TFLOPs||8.2 TFLOPs||24.6 TFLOPs||26.8 TFLOPs||29.6 TFLOPs|
|FP32 Compute||5.7 TFLOPs||8.2 TFLOPs||12.3 TFLOPs||13.4 TFLOPs||14.8 TFLOPs|
|FP64 Compute||384 GFLOPs||512 GFLOPs||768 GFLOPs||6.7 TFLOPs||7.4 TFLOPs|
|VRAM||16 GB GDDR5||4 GB HBM1||16 GB HBM2||16 GB HBM2||32 GB HBM2|
|Memory Clock||1750 MHz||500 MHz||472 MHz||500 MHz||500 MHz|
|Memory Bus||256-bit bus||4096-bit bus||2048-bit bus||4096-bit bus||4096-bit bus|
|Memory Bandwidth||224 GB/s||512 GB/s||484 GB/s||1 TB/s||1 TB/s|
|Form Factor||Single Slot, Full Length||Dual Slot, Half Length||Dual Slot, Full Length||Dual Slot, Full Length||Dual Slot, Full Length|
|Cooling||Passive Cooling||Passive Cooling||Passive Cooling||Passive Cooling||Passive Cooling|
AMD has put a lot of emphasis on SR-IOV (Single Root I/O Virtualization) and reveal that the Vega 10 GPU can support up to 16 Virtual Machines at once. AMD is also mentioning their Rapid Packed Math technology which allows for 16-bit math operations. There’s also talk of an AMD ROCM stack which is constantly being updated and improved to support the latest professional workloads so that AMD GPUs are properly optimized for various tasks, especially the Radeon Instinct line of GPU based accelerators which are utilizing the Vega architecture. You can learn more about the Radeon Vega Instinct accelerators here.