HardwareHot ChipsPC

NVIDIA Volta Tesla V100 and AMD Vega 10 GPUs Detailed at Hot Chips 2017 – The Flagship Compute Powerhouses of Team Green and Red

•

Aug 29, 2017 at 03:58pm EDT

NVIDIA and AMD, the graphics giants of the modern world have detailed their next generation GPU architectures at Hot Chips 2017. The latest details include an in-depth look at the NVIDIA Volta and AMD Vega GPUs which are powering high performance machines to complete the most demanding tasks known to mankind.

NVIDIA Volta and AMD Vega GPUs Detailed at Hot Chips 2017 - The Most In-Depth Look You'll Ever Get at the Modern Day Powerhouses of the Graphics World

The NVIDIA Volta and AMD Vega GPUs were introduced this year. Both GPU architectures are created for the most demanding graphics tasks that the HPC market can think of. We already detailed the NVIDIA Volta and AMD Vega architectures in detail but here's an even more detailed look at the latest generation of FinFET GPUs from each company.

NVIDIA Tesla V100 - Built With the Tesla GV100 Chip, Up To 84 SMs With 5376 CUDA Cores

First up, we have the NVIDIA Tesla V100 which is built using the Volta GV100 GPU. The GPU consists of 21 Billion transistors that are packaged in a die measuring 815mm2. The Tesla V100 specifically uses 80 of the 84 SMs available on the GV100 GPU. That is to enable higher yields without the need to sacrifice much of the compute capabilities. With the 80 SMs on board the GPU, we get 5120 cuda cores, 2560 double precision cores and 640 tensor cores. There's also 16 GB of HBM2 on board the chip and by chip, we are referring to the interposer that houses the GPU and VRAM.

The HBM2 that is featured on Tesla V100 is by far the fastest we have seen to date and developed by Samsung. It's running at 900 GB/s which is almost a terabyte worth of bandwidth. There's also the NVLINK interconnect that runs at 300 GB/s per GPU, offering much higher communication speeds than the one released a generation prior on Pascal GPUs which was rated at 160 GB/s.

When we compare raw performance stats, the results look absolutely shocking. The Tesla P100 released last year but the Tesla V100 is faster and better in ever possible way. The Deep Learning speed has gone up by a factor of 12x (120 TFLOPs vs 10 TFLOPs), the Deep Learning Inference has gone up by a factor of 6x (120 TFLOPs vs 21 TFLOPs), Single and Half precision compute is up by 50% while both cache and bandwidth have received significant updates.

NVIDIA mentions that they have achieved a 50% increase in efficiency per SM with Tesla V100 compared to Tesla P100 and the improved SIMT architecture along with tensor acceleration that can deliver up to 9.3x speedup in certain workloads (provided the CUDA 9 software optimization) will make Volta a revolutionary step in GPU engineering for them. You can read more about the NVIDIA Volta GV100 GPU and the Tesla V100 accelerator here.

NVIDIA Volta Tesla V100S Specs:

NVIDIA Tesla Graphics Card	Tesla K40 (PCI-Express)	Tesla M40 (PCI-Express)	Tesla P100 (PCI-Express)	Tesla P100 (SXM2)	Tesla V100 (PCI-Express)	Tesla V100 (SXM2)	Tesla V100S (PCIe)
GPU	GK110 (Kepler)	GM200 (Maxwell)	GP100 (Pascal)	GP100 (Pascal)	GV100 (Volta)	GV100 (Volta)	GV100 (Volta)
Process Node	28nm	28nm	16nm	16nm	12nm	12nm	12nm
Transistors	7.1 Billion	8 Billion	15.3 Billion	15.3 Billion	21.1 Billion	21.1 Billion	21.1 Billion
GPU Die Size	551 mm2	601 mm2	610 mm2	610 mm2	815mm2	815mm2	815mm2
SMs	15	24	56	56	80	80	80
TPCs	15	24	28	28	40	40	40
CUDA Cores Per SM	192	128	64	64	64	64	64
CUDA Cores (Total)	2880	3072	3584	3584	5120	5120	5120
Texture Units	240	192	224	224	320	320	320
FP64 CUDA Cores / SM	64	4	32	32	32	32	32
FP64 CUDA Cores / GPU	960	96	1792	1792	2560	2560	2560
Base Clock	745 MHz	948 MHz	1190 MHz	1328 MHz	1230 MHz	1297 MHz	TBD
Boost Clock	875 MHz	1114 MHz	1329MHz	1480 MHz	1380 MHz	1530 MHz	1601 MHz
FP16 Compute	N/A	N/A	18.7 TFLOPs	21.2 TFLOPs	28.0 TFLOPs	30.4 TFLOPs	32.8 TFLOPs
FP32 Compute	5.04 TFLOPs	6.8 TFLOPs	10.0 TFLOPs	10.6 TFLOPs	14.0 TFLOPs	15.7 TFLOPs	16.4 TFLOPs
FP64 Compute	1.68 TFLOPs	0.2 TFLOPs	4.7 TFLOPs	5.30 TFLOPs	7.0 TFLOPs	7.80 TFLOPs	8.2 TFLOPs
Memory Interface	384-bit GDDR5	384-bit GDDR5	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM
Memory Size	12 GB GDDR5 @ 288 GB/s	24 GB GDDR5 @ 288 GB/s	16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s	16 GB HBM2 @ 732 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 1134 GB/s
L2 Cache Size	1536 KB	3072 KB	4096 KB	4096 KB	6144 KB	6144 KB	6144 KB
TDP	235W	250W	250W	300W	250W	300W	250W

AMD Vega 10 - The First Flagship Radeon GPU in More Than Two Years Detailed

AMD released their flagship Vega GPU, the Vega 64 earlier this month. It's aiming at the high-end segment which has seen no action by team Radeon for the last two years, but now the wait is over. The AMD Vega 10 GPU or Vega 64 (the full fat die) is based on the 14nm FinFET architecture from GlobalFoundries and has a die size of 486mm2. It houses 12.5 Billion transistors. The package is measuring at 2256mm2 compared to 2500mm2 of the Fiji chip. The power envelope is stated at 150-300W which means there will be significantly more variants of this chip.

The last part is confirmed through another interesting detail in the slide which mentions that the 2 stacks of HBM2 can incorporate 4 GB, 8 GB and 16 GB of VRAM. We have already seen 8 GB and 16 GB variants of the chip but it shows that there are still more to come. A 4 GB HBM2 VRAM size for a Vega 10 SKU that ships with a TDP under 200W will be pretty sweet.

When it comes to the architectural layout, Vega is different compared to Fiji in many aspects. The block diagram of the Vega 10 core shows that the chip consists of a single graphics engine with 4 ACE (Asynchronous Compute Engine) units, 2 SDMA (System DMA) units and a fully operational IF (Infinity Fabric) interconnect that runs within the GPU. The graphics engine consists of 4 DSBRs (Draw Stream Binning Rasterizers), Flexible Geometry Engines, 64 Pixel Units and 256 texture units. The Unified Compute Engine is made up of 64 NCUs (Next Compute Units) that house 4096 stream processors and also 4 MB of L2 cache.

AMD Radeon Instinct Accelerators:

Accelerator Name	AMD Radeon Instinct MI6	AMD Radeon Instinct MI8	AMD Radeon Instinct MI25	AMD Radeon Instinct MI60	AMD Radeon Instinct MI60
GPU Architecture	Polaris 10	Fiji XT	Vega 10	Vega 20	Vega 20
GPU Process Node	14nm FinFET	28nm	14nm FinFET	7nm FinFET	7nm FinFET
GPU Cores	2304	4096	4096	3840	4096
GPU Clock Speed	1237 MHz	1000 MHz	1500 MHz	1746 MHz	1800 MHz
FP16 Compute	5.7 TFLOPs	8.2 TFLOPs	24.6 TFLOPs	26.8 TFLOPs	29.6 TFLOPs
FP32 Compute	5.7 TFLOPs	8.2 TFLOPs	12.3 TFLOPs	13.4 TFLOPs	14.8 TFLOPs
FP64 Compute	384 GFLOPs	512 GFLOPs	768 GFLOPs	6.7 TFLOPs	7.4 TFLOPs
VRAM	16 GB GDDR5	4 GB HBM1	16 GB HBM2	16 GB HBM2	32 GB HBM2
Memory Clock	1750 MHz	500 MHz	472 MHz	500 MHz	500 MHz
Memory Bus	256-bit bus	4096-bit bus	2048-bit bus	4096-bit bus	4096-bit bus
Memory Bandwidth	224 GB/s	512 GB/s	484 GB/s	1 TB/s	1 TB/s
Form Factor	Single Slot, Full Length	Dual Slot, Half Length	Dual Slot, Full Length	Dual Slot, Full Length	Dual Slot, Full Length
Cooling	Passive Cooling	Passive Cooling	Passive Cooling	Passive Cooling	Passive Cooling
TDP	150W	175W	300W	300W	300W

AMD has put a lot of emphasis on SR-IOV (Single Root I/O Virtualization) and reveal that the Vega 10 GPU can support up to 16 Virtual Machines at once. AMD is also mentioning their Rapid Packed Math technology which allows for 16-bit math operations. There's also talk of an AMD ROCM stack which is constantly being updated and improved to support the latest professional workloads so that AMD GPUs are properly optimized for various tasks, especially the Radeon Instinct line of GPU based accelerators which are utilizing the Vega architecture. You can learn more about the Radeon Vega Instinct accelerators here.

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.