NVIDIA Volta has just been announced at GTC 2017 and boy it's a beast. The next-generation graphics processing unit is the world's first chip that will make use of the industry leading TSMC 12nm FinFET process, so let's cover every detail of this compute powerhouse.
NVIDIA Volta GV100 Unveiled - Tesla V100 With 5120 CUDA Cores, 16 GB HBM2 and 12nm FinFET Process
Last GTC, NVIDIA announced the Pascal based GP100 GPU. It was back then, the fastest graphics chip designed for supercomputers. This year, NVIDIA is taking the next leap in graphics performance and announced their Volta based GV100 GPU. We are going to take a very deep look at the next-generation GPU designed for AI Deep Learning.
"Artificial intelligence is driving the greatest technology advances in human history," said Jensen Huang, founder and chief executive officer of NVIDIA, who unveiled Volta at his GTC keynote. "It will automate intelligence and spur a wave of social progress unmatched since the industrial revolution.
"Deep learning, a groundbreaking AI approach that creates computer software that learns, has insatiable demand for processing power. Thousands of NVIDIA engineers spent over three years crafting Volta to help meet this need, enabling the industry to realize AI's life-changing potential," he said.
Volta, NVIDIA's seventh-generation GPU architecture, is built with 21 billion transistors and delivers the equivalent performance of 100 CPUs for deep learning.
It provides a 5x improvement over Pascal, the current-generation NVIDIA GPU architecture, in peak teraflops, and 15x over the Maxwell architecture, launched two years ago. This performance surpasses by 4x the improvements that Moore's law would have predicted.
First of all, we need to talk about the workloads this specific chip is designed to handle. The NVIDIA Volta GV100 GPU is designed to power the most computationally intensive HPC, AI, and graphics workloads.
The GV100 GPU includes 21.1 billion transistors with a die size of 815 mm2. It is fabricated on a new TSMC 12 nm FFN high performance manufacturing process customized for NVIDIA. The GPU is much bigger than the 610mm2 Pascal GP100 GPU. NVIDIA Volta GV100 delivers considerably more compute performance, and adds many new features compared to its predecessor, the Pascal GP100 GPU and its architecture family. Further simplifying GPU programming and application porting, GV100 also improves GPU resource utilization. GV100 is an extremely power-efficient processor, delivering exceptional performance per watt.
The chip itself is a behometh, featuring a brand new chip architecture that is just insane in terms of raw specifications. The NVIDIA Volta GV100 GPU is composed of six GPC (Graphics Processing Clusters). It has a total of 84 Volta streaming multiprocessor units, 42 TPCs (each including two SMs). The 84 SMs come with 64 CUDA cores per SM so we are looking at a total of 5376 CUDA cores on the complete die. All of the 5376 CUDA Cores can be used for FP32 and INT32 programming instructions while there are also a total of 2688 FP64 (Double Precision) cores. Aside from these, we are looking at 672 Tensor processors, 336 Texture Units.
The memory architecture is updated with eight 512-bit memory controllers. This rounds up to a total of 4096-bit bus interface that supports up to 16 GB of HBM2 VRAM. The bandwidth is boosted with speeds of 878 MHz, which delivers increased transfer rates of 900 GB/s compared to 720 GB/s on Pascal GP100. Each memory controller is attached to 768 KB of L2 cache which totals to 6 MB of L2 cache for the entire chip.
NVIDIA Tesla Graphics Cards Comparison:
|Tesla Graphics Card Name||NVIDIA Tesla M2090||NVIDIA Tesla K40||NVIDIA Telsa K80||NVIDIA Tesla P100||NVIDIA Tesla V100|
|GPU Name||GF110||GK110||GK210 x 2||GP100||GV100|
|Transistor Count||3.00 Billion||7.08 Billion||7.08 Billion||15 Billion||21.1 Billion|
|CUDA Cores||512 CCs (16 CUs)||2880 CCs (15 CUs)||2496 CCs (13 CUs) x 2||3840 CCs||5120 CCs|
|Core Clock||Up To 650 MHz||Up To 875 MHz||Up To 875 MHz||Up To 1480 MHz||Up To 1455 MHz|
|FP32 Compute||1.33 TFLOPs||4.29 TFLOPs||8.74 TFLOPs||10.6 TFLOPs||15.0 TFLOPs|
|FP64 Compute||0.66 TFLOPs||1.43 TFLOPs||2.91 TFLOPs||5.30 TFLOPs||7.50 TFLOPs|
|VRAM Size||6 GB||12 GB||12 GB x 2||16 GB||16 GB|
|VRAM Bus||384-bit||384-bit||384-bit x 2||4096-bit||4096-bit|
|VRAM Speed||3.7 GHz||6 GHz||5 GHz||737 MHz||878 MHz|
|Memory Bandwidth||177.6 GB/s||288 GB/s||240 GB/s||720 GB/s||900 GB/s|
NVIDIA Volta SM (Streaming Multiprocessor)
Architected to deliver higher performance, the Volta SM has lower instruction and cache latencies than past SM designs and includes new features to accelerate deep learning applications.
Major Features include:
- New mixed-precision FP16/FP32 Tensor Cores purpose-built for deep learning matrix arithmetic;
- Enhanced L1 data cache for higher performance and lower latency;
- Streamlined instruction set for simpler decoding and reduced instruction latencies;
- Higher clocks and higher power efficiency.
Similar to Pascal GP100, the GV100 SM incorporates 64 FP32 cores and 32 FP64 cores per SM. However, the GV100 SM uses a new partitioning method to improve SM utilization and overall performance. Recall the GP100 SM is partitioned into two processing blocks, each with 32 FP32 Cores, 16 FP64 Cores, an instruction buffer, one warp scheduler, two dispatch units, and a 128 KB Register File. The GV100 SM is partitioned into four processing blocks, each with 16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two of the new mixed-precision Tensor Cores for deep learning matrix arithmetic, a new L0 instruction cache, one warp scheduler, one dispatch unit, and a 64 KB Register File. Note that the new L0 instruction cache is now used in each partition to provide higher efficiency than the instruction buffers used in prior NVIDIA GPUs.
While a GV100 SM has the same number of registers as a Pascal GP100 SM, the entire GV100 GPU has far more SMs, and thus many more registers overall. In aggregate, GV100 supports more threads, warps, and thread blocks in flight compared to prior GPU generations.
Overall shared memory across the entire GV100 GPU is increased due to the increased SM count and potential for up to 96 KB of Shared Memory per SM, compared to 64 KB in GP100.
Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal.
NVIDIA Volta Tesla V100S Specs:
|NVIDIA Tesla Graphics Card||Tesla K40|
|Tesla P100 (SXM2)||Tesla V100 (PCI-Express)||Tesla V100 (SXM2)||Tesla V100S (PCIe)|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)||GP100 (Pascal)||GV100 (Volta)||GV100 (Volta)||GV100 (Volta)|
|Transistors||7.1 Billion||8 Billion||15.3 Billion||15.3 Billion||21.1 Billion||21.1 Billion||21.1 Billion|
|GPU Die Size||551 mm2||601 mm2||610 mm2||610 mm2||815mm2||815mm2||815mm2|
|CUDA Cores Per SM||192||128||64||64||64||64||64|
|CUDA Cores (Total)||2880||3072||3584||3584||5120||5120||5120|
|FP64 CUDA Cores / SM||64||4||32||32||32||32||32|
|FP64 CUDA Cores / GPU||960||96||1792||1792||2560||2560||2560|
|Base Clock||745 MHz||948 MHz||1190 MHz||1328 MHz||1230 MHz||1297 MHz||TBD|
|Boost Clock||875 MHz||1114 MHz||1329MHz||1480 MHz||1380 MHz||1530 MHz||1601 MHz|
|FP16 Compute||N/A||N/A||18.7 TFLOPs||21.2 TFLOPs||28.0 TFLOPs||30.4 TFLOPs||32.8 TFLOPs|
|FP32 Compute||5.04 TFLOPs||6.8 TFLOPs||10.0 TFLOPs||10.6 TFLOPs||14.0 TFLOPs||15.7 TFLOPs||16.4 TFLOPs|
|FP64 Compute||1.68 TFLOPs||0.2 TFLOPs||4.7 TFLOPs||5.30 TFLOPs||7.0 TFLOPs||7.80 TFLOPs||8.2 TFLOPs|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2||4096-bit HBM2||4096-bit HBM2||4096-bit HBM2||4096-bit HBM|
|Memory Size||12 GB GDDR5 @ 288 GB/s||24 GB GDDR5 @ 288 GB/s||16 GB HBM2 @ 732 GB/s|
12 GB HBM2 @ 549 GB/s
|16 GB HBM2 @ 732 GB/s||16 GB HBM2 @ 900 GB/s||16 GB HBM2 @ 900 GB/s||16 GB HBM2 @ 1134 GB/s|
|L2 Cache Size||1536 KB||3072 KB||4096 KB||4096 KB||6144 KB||6144 KB||6144 KB|
NVIDIA VOLTA GV100 GPU WITH ADVANCED TENSOR CORES
Tesla P100 delivered considerably higher performance for training neural networks compared to the prior generation NVIDIA Maxwell and Kepler architectures, but the complexity and size of neural networks have continued to grow. New networks that have thousands of layers and millions of neurons demand even higher performance and faster training times.
New Tensor Cores are the most important feature of the Volta GV100 architecture to help deliver the performance required to train large neural networks. Tesla V100’s Tensor Cores deliver up to 120 Tensor TFLOPS for training and inference applications. Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 for deep learning training compared to P100 FP32 operations, and for deep learning inference, up to 6x higher peak TFLOPS compared to P100 FP16 operations. The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM.
Matrix-Matrix multiplication (BLAS GEMM) operations are at the core of neural network training and inferencing, and are used to multiply large matrices of input data and weights in the connected layers of the network. As Figure 6 shows, Tensor Cores in the Tesla V100 GPU boost the performance of these operations by more than 9x compared to the Pascal-based GP100 GPU.
|GPU||Kepler GK110||Maxwell GM200||Pascal GP100||Volta GV100|
|Threads / Warp||32||32||32||32|
|Max Warps / Multiprocessor||64||64||64||64|
|Max Threads / Multiprocessor||2048||2048||2048||2048|
|Max Thread Blocks / Multiprocessor||16||32||32||32|
|Max 32-bit Registers / SM||65536||65536||65536||65536|
|Max Registers / Block||65536||32768||65536||65536|
|Max Registers / Thread||255||255||255||255|
|Max Thread Block Size||1024||1024||1024||1024|
|CUDA Cores / SM||192||128||64||64|
|Shared Memory Size / SM Configurations (bytes)||16K/32K/48K||96K||64K||96K|
NVIDIA VOLTA GV100 GPU WITH ENHANCED L1 DATA CACHE AND SHARED MEMORY
The new combined L1 data cache and shared memory subsystem of the Volta SM significantly improves performance while also simplifying programming and reducing the tuning required to attain at or near-peak application performance.
Combining data cache and shared memory functionality into a single memory block provides the best overall performance for both types of memory accesses. The combined capacity is 128 KB/SM, more than 7 times larger than the GP100 data cache, and all of it is usable as a cache by programs that do not use shared memory. Texture units also use the cache. For example, if shared memory is configured to 64 KB, texture and load/store operations can use the remaining 64 KB of L1.
Integration within the shared memory block ensures the Volta GV100 L1 cache has much lower latency and higher bandwidth than the L1 caches in past NVIDIA GPUs. The L1 In Volta functions as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data—the best of both worlds. This combination is unique to Volta and delivers more accessible performance than in the past.
NVIDIA Volta GV100 GPU Key Features:
Key compute features of the NVIDIA Volta GV100 based Tesla V100 include the following:
- New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU. The new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope. New Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPs for training. With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations. Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming.
- Second-Generation NVLink The second generation of NVIDIA’s NVLink high-speed interconnect delivers higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations. GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. NVLink now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers. The new NVIDIA DGX-1 with V100 AI supercomputer uses NVLink to deliver greater scalability for ultra-fast deep learning training.
- HBM2 Memory: Faster, Higher Efficiency Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth. The combination of both a new generation HBM2 memory from Samsung, and a new generation memory controller in Volta, provides 1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95% memory bandwidth efficiency running many workloads.
- Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.
- Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.
- Cooperative Groups and New Cooperative Launch APIs Cooperative Groups is a new programming model introduced in CUDA 9 for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new Cooperative Launch APIs that support synchronization amongst CUDA thread blocks. Volta adds support for new synchronization patterns.
- Maximum Performance and Maximum Efficiency Modes In Maximum Performance mode, the Tesla V100 accelerator will operate unconstrained up to its TDP (Thermal Design Power) level of 300W to accelerate applications that require the fastest computational speed and highest data throughput. Maximum Efficiency Mode allows data center managers to tune power usage of their Tesla V100 accelerators to operate with optimal performance per watt. A not-to-exceed power cap can be set across all GPUs in a rack, reducing power consumption dramatically, while still obtaining excellent rack performance.
- Volta Optimized Software New versions of deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others harness the performance of Volta to deliver dramatically faster training times and higher multi-node training performance. Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications. The NVIDIA CUDA Toolkit version 9.0 includes new APIs and support for Volta features to provide even easier programmability.
|GPU Family||AMD Vega||AMD Navi||NVIDIA Pascal||NVIDIA Volta|
|Flagship GPU||Vega 10||Navi 10||NVIDIA GP100||NVIDIA GV100|
|GPU Process||14nm FinFET||7nm FinFET||TSMC 16nm FinFET||TSMC 12nm FinFET|
|GPU Transistors||15-18 Billion||TBC||15.3 Billion||21.1 Billion|
|GPU Cores (Max)||4096 SPs||TBC||3840 CUDA Cores||5376 CUDA Cores|
|Peak FP32 Compute||13.0 TFLOPs||TBC||12.0 TFLOPs||>15.0 TFLOPs (Full Die)|
|Peak FP16 Compute||25.0 TFLOPs||TBC||24.0 TFLOPs||120 Tensor TFLOPs|
|VRAM||16 GB HBM2||TBC||16 GB HBM2||16 GB HBM2|
|Memory (Consumer Cards)||HBM2||HBM3||GDDR5X||GDDR6|
|Memory (Dual-Chip Professional/ HPC)||HBM2||HBM3||HBM2||HBM2|
|HBM2 Bandwidth||484 GB/s (Frontier Edition)||>1 TB/s?||732 GB/s (Peak)||900 GB/s|
|Graphics Architecture||Next Compute Unit (Vega)||Next Compute Unit (Navi)||5th Gen Pascal CUDA||6th Gen Volta CUDA|
|Successor of (GPU)||Radeon RX 500 Series||Radeon RX 600 Series||GM200 (Maxwell)||GP100 (Pascal)|
NVIDIA has stated that the NVIDIA Volta GV100 GPU based Tesla V100 will start shipping in 2017. We are looking at availability in 2H 2017 so we can expect consumer variants well and ready for launch in early 2018.