NVIDIA Volta GV100 12nm FinFET GPU Detailed – Tesla V100 Specifications Include 21 Billion Transistors, 5120 CUDA Cores, 16 GB HBM2 With 900 GB/s Bandwidth

Author Photo
May 10, 2017
343Shares
Submit

NVIDIA Volta has just been announced at GTC 2017 and boy it’s a beast. The next-generation graphics processing unit is the world’s first chip that will make use of the industry leading TSMC 12nm FinFET process, so let’s cover every detail of this compute powerhouse.

NVIDIA Volta GV100 Unveiled – Tesla V100 With 5120 CUDA Cores, 16 GB HBM2 and 12nm FinFET Process

Last GTC, NVIDIA announced the Pascal based GP100 GPU. It was back then, the fastest graphics chip designed for supercomputers. This year, NVIDIA is taking the next leap in graphics performance and announced their Volta based GV100 GPU. We are going to take a very deep look at the next-generation GPU designed for AI Deep Learning.

colorful-igame-geforce-gtx-1080-ti-kudan_3-3RelatedThe World’s Fastest GeForce GTX 1080 Ti is Here – Meet The Colorful iGame GTX 1080 Ti KUDAN That Hits Almost 1800 MHz Out of Box, Features Hybrid Cooling Design

“Artificial intelligence is driving the greatest technology advances in human history,” said Jensen Huang, founder and chief executive officer of NVIDIA, who unveiled Volta at his GTC keynote. “It will automate intelligence and spur a wave of social progress unmatched since the industrial revolution.

“Deep learning, a groundbreaking AI approach that creates computer software that learns, has insatiable demand for processing power. Thousands of NVIDIA engineers spent over three years crafting Volta to help meet this need, enabling the industry to realize AI’s life-changing potential,” he said.

Volta, NVIDIA’s seventh-generation GPU architecture, is built with 21 billion transistors and delivers the equivalent performance of 100 CPUs for deep learning.

nvidia-tesla-v100-volta-gpu_2RelatedNVIDIA’s CEO Gives Away Their Mighty Volta GPU Based Tesla V100 AI Accelerators To Top 15 AI Research Institutions

It provides a 5x improvement over Pascal, the current-generation NVIDIA GPU architecture, in peak teraflops, and 15x over the Maxwell architecture, launched two years ago. This performance surpasses by 4x the improvements that Moore’s law would have predicted.

via NVIDIA

First of all, we need to talk about the workloads this specific chip is designed to handle. The NVIDIA Volta GV100 GPU is designed to power the most computationally intensive HPC, AI, and graphics workloads.

The GV100 GPU includes 21.1 billion transistors with a die size of 815 mm2. It is fabricated on a new TSMC 12 nm FFN high performance manufacturing process customized for NVIDIA. The GPU is much bigger than the 610mm2 Pascal GP100 GPU. NVIDIA Volta GV100 delivers considerably more compute performance, and adds many new features compared to its predecessor, the Pascal GP100 GPU and its architecture family. Further simplifying GPU programming and application porting, GV100 also improves GPU resource utilization. GV100 is an extremely power-efficient processor, delivering exceptional performance per watt.

The chip itself is a behometh, featuring a brand new chip architecture that is just insane in terms of raw specifications. The NVIDIA Volta GV100 GPU is composed of six GPC (Graphics Processing Clusters). It has a total of 84 Volta streaming multiprocessor units, 42 TPCs (each including two SMs). The 84 SMs come with 64 CUDA cores per SM so we are looking at a total of 5376 CUDA cores on the complete die. All of the 5376 CUDA Cores can be used for FP32 and INT32 programming instructions while there are also a total of 2688 FP64 (Double Precision) cores. Aside from these, we are looking at 672 Tensor processors, 336 Texture Units.

The memory architecture is updated with eight 512-bit memory controllers. This rounds up to a total of 4096-bit bus interface that supports up to 16 GB of HBM2 VRAM. The bandwidth is boosted with speeds of 878 MHz, which delivers increased transfer rates of 900 GB/s compared to 720 GB/s on Pascal GP100. Each memory controller is attached to 768 KB of L2 cache which totals to 6 MB of L2 cache for the entire chip.

NVIDIA Tesla Graphics Cards Comparison:

Tesla Graphics Card NameNVIDIA Tesla M2090NVIDIA Tesla K40NVIDIA Telsa K80NVIDIA Tesla P100NVIDIA Tesla V100
GPU Process40nm28nm28nm16nm12nm
GPU NameGF110GK110GK210 x 2GP100GV100
Die Size520mm2561mm2561mm2610mm2815mm2
Transistor Count3.00 Billion7.08 Billion7.08 Billion15 Billion21.1 Billion
CUDA Cores512 CCs (16 CUs)2880 CCs (15 CUs)2496 CCs (13 CUs) x 23840 CCs5120 CCs
Core ClockUp To 650 MHzUp To 875 MHzUp To 875 MHzUp To 1480 MHzUp To 1455 MHz
FP32 Compute1.33 TFLOPs4.29 TFLOPs8.74 TFLOPs10.6 TFLOPs15.0 TFLOPs
FP64 Compute0.66 TFLOPs1.43 TFLOPs2.91 TFLOPs5.30 TFLOPs7.50 TFLOPs
VRAM Size6 GB12 GB12 GB x 216 GB16 GB
VRAM TypeGDDR5GDDR5GDDR5HBM2HBM2
VRAM Bus384-bit384-bit384-bit x 24096-bit4096-bit
VRAM Speed3.7 GHz6 GHz5 GHz737 MHz878 MHz
Memory Bandwidth177.6 GB/s288 GB/s240 GB/s720 GB/s900 GB/s
Maximum TDP250W300W235W300W300W

NVIDIA Volta SM (Streaming Multiprocessor)

Architected to deliver higher performance, the Volta SM has lower instruction and cache latencies than past SM designs and includes new features to accelerate deep learning applications.

Major Features include:

  • New mixed-precision FP16/FP32 Tensor Cores purpose-built for deep learning matrix arithmetic;
  • Enhanced L1 data cache for higher performance and lower latency;
  • Streamlined instruction set for simpler decoding and reduced instruction latencies;
  • Higher clocks and higher power efficiency.

Similar to Pascal GP100, the GV100 SM incorporates 64 FP32 cores and 32 FP64 cores per SM. However, the GV100 SM uses a new partitioning method to improve SM utilization and overall performance. Recall the GP100 SM is partitioned into two processing blocks, each with 32 FP32 Cores, 16 FP64 Cores, an instruction buffer, one warp scheduler, two dispatch units, and a 128 KB Register File. The GV100 SM is partitioned into four processing blocks, each with 16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two of the new mixed-precision Tensor Cores for deep learning matrix arithmetic, a new L0 instruction cache, one warp scheduler, one dispatch unit, and a 64 KB Register File. Note that the new L0 instruction cache is now used in each partition to provide higher efficiency than the instruction buffers used in prior NVIDIA GPUs.

While a GV100 SM has the same number of registers as a Pascal GP100 SM, the entire GV100 GPU has far more SMs, and thus many more registers overall. In aggregate, GV100 supports more threads, warps, and thread blocks in flight compared to prior GPU generations.

Overall shared memory across the entire GV100 GPU is increased due to the increased SM count and potential for up to 96 KB of Shared Memory per SM, compared to 64 KB in GP100.

Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal.

NVIDIA Volta Tesla V100 Specs:

NVIDIA Tesla Graphics CardTesla K40
(PCI-Express)
Tesla M40
(PCI-Express)
Tesla P100
(PCI-Express)
Tesla P100
(PCI-Express)
Tesla P100 (SXM2)Tesla V100 (PCI-Express)Tesla V100 (SXM2)
GPUGK110 (Kepler)GM200 (Maxwell)GP100 (Pascal)GP100 (Pascal)GP100 (Pascal)GV100 (Volta)GV100 (Volta)
Process Node28nm28nm16nm16nm16nm12nm12nm
Transistors7.1 Billion8 Billion15.3 Billion15.3 Billion15.3 Billion21.1 Billion21.1 Billion
GPU Die Size551 mm2601 mm2610 mm2610 mm2610 mm2815mm2815mm2
SMs15245656568080
TPCs15242828284040
CUDA Cores Per SM1921286464646464
CUDA Cores (Total)2880307235843584358451205120
FP64 CUDA Cores / SM6443232323232
FP64 CUDA Cores / GPU9609617921792179225602560
Base Clock745 MHz948 MHzTBDTBD1328 MHzTBD1370 MHz
Boost Clock875 MHz1114 MHz1300MHz1300MHz1480 MHz1370 MHz1455 MHz
FP16 ComputeN/AN/A18.7 TFLOPs18.7 TFLOPs21.2 TFLOPs28.0 TFLOPs30.0 TFLOPs
FP32 Compute5.04 TFLOPs6.8 TFLOPs10.0 TFLOPs10.0 TFLOPs10.6 TFLOPs14.0 TFLOPs15.0 TFLOPs
FP64 Compute1.68 TFLOPs0.2 TFLOPs4.7 TFLOPs4.7 TFLOPs5.30 TFLOPs7.0 TFLOPs7.50 TFLOPs
Texture Units240192224224224320320
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM2
Memory Size12 GB GDDR5 @ 288 GB/s24 GB GDDR5 @ 288 GB/s12 GB HBM2 @ 549 GB/s16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 900 GB/s
L2 Cache Size1536 KB3072 KB4096 KB4096 KB4096 KB6144 KB6144 KB
TDP235W250W250W250W300W250W300W

NVIDIA VOLTA GV100 GPU WITH ADVANCED TENSOR CORES

Tesla P100 delivered considerably higher performance for training neural networks compared to the prior generation NVIDIA Maxwell and Kepler architectures, but the complexity and size of neural networks have continued to grow. New networks that have thousands of layers and millions of neurons demand even higher performance and faster training times.

New Tensor Cores are the most important feature of the Volta GV100 architecture to help deliver the performance required to train large neural networks. Tesla V100’s Tensor Cores deliver up to 120 Tensor TFLOPS for training and inference applications. Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 for deep learning training compared to P100 FP32 operations, and for deep learning inference, up to 6x higher peak TFLOPS  compared to P100 FP16 operations. The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM.

Matrix-Matrix multiplication (BLAS GEMM) operations are at the core of neural network training and inferencing, and are used to multiply large matrices of input data and weights in the connected layers of the network. As Figure 6 shows, Tensor Cores in the Tesla V100 GPU boost the performance of these operations by more than 9x compared to the Pascal-based GP100 GPU.

GPUKepler GK110Maxwell GM200Pascal GP100Volta GV100
Compute Capability3.55.36.07.0
Threads / Warp32323232
Max Warps / Multiprocessor64646464
Max Threads / Multiprocessor2048204820482048
Max Thread Blocks / Multiprocessor16323232
Max 32-bit Registers / SM65536655366553665536
Max Registers / Block65536327686553665536
Max Registers / Thread255255255255
Max Thread Block Size1024102410241024
CUDA Cores / SM1921286464
Shared Memory Size / SM Configurations (bytes)16K/32K/48K96K64K96K

NVIDIA VOLTA GV100 GPU WITH ENHANCED L1 DATA CACHE AND SHARED MEMORY

The new combined L1 data cache and shared memory subsystem of the Volta SM significantly improves performance while also simplifying programming and reducing the tuning required to attain at or near-peak application performance.

Combining data cache and shared memory functionality into a single memory block provides the best overall performance for both types of memory accesses. The combined capacity is 128 KB/SM, more than 7 times larger than the GP100 data cache, and all of it is usable as a cache by programs that do not use shared memory. Texture units also use the cache. For example, if shared memory is configured to 64 KB, texture and load/store operations can use the remaining 64 KB of L1.

Integration within the shared memory block ensures the Volta GV100 L1 cache has much lower latency and higher bandwidth than the L1 caches in past NVIDIA GPUs. The L1 In Volta functions as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data—the best of both worlds. This combination is unique to Volta and delivers more accessible performance than in the past.

NVIDIA Volta GV100 GPU Key Features:

Key compute features of the NVIDIA Volta GV100 based Tesla V100 include the following:

  • New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU. The new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope. New Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPs for training. With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations. Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming.
  • Second-Generation NVLink The second generation of NVIDIA’s NVLink high-speed interconnect delivers higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations. GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. NVLink now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers. The new NVIDIA DGX-1 with V100 AI supercomputer uses NVLink to deliver greater scalability for ultra-fast deep learning training.
  • HBM2 Memory: Faster, Higher Efficiency Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth. The combination of both a new generation HBM2 memory from Samsung, and a new generation memory controller in Volta, provides 1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95% memory bandwidth efficiency running many workloads.
  • Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.
  • Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.
  • Cooperative Groups and New Cooperative Launch APIs Cooperative Groups is a new programming model introduced in CUDA 9 for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new Cooperative Launch APIs that support synchronization amongst CUDA thread blocks. Volta adds support for new synchronization patterns.
  • Maximum Performance and Maximum Efficiency Modes In Maximum Performance mode, the Tesla V100 accelerator will operate unconstrained up to its TDP (Thermal Design Power) level of 300W to accelerate applications that require the fastest computational speed and highest data throughput. Maximum Efficiency Mode allows data center managers to tune power usage of their Tesla V100 accelerators to operate with optimal performance per watt. A not-to-exceed power cap can be set across all GPUs in a rack, reducing power consumption dramatically, while still obtaining excellent rack performance.
  • Volta Optimized Software New versions of deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others harness the performance of Volta to deliver dramatically faster training times and higher multi-node training performance. Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications. The NVIDIA CUDA Toolkit version 9.0 includes new APIs and support for Volta features to provide even easier programmability.
GPU FamilyAMD VegaAMD NaviNVIDIA PascalNVIDIA Volta
Flagship GPUVega 10Navi 10NVIDIA GP100NVIDIA GV100
GPU Process14nm FinFET7nm FinFETTSMC 16nm FinFETTSMC 12nm FinFET
GPU Transistors15-18 BillionTBC15.3 Billion21.1 Billion
GPU Cores (Max)4096 SPsTBC3840 CUDA Cores5376 CUDA Cores
Peak FP32 Compute13.0 TFLOPsTBC12.0 TFLOPs>15.0 TFLOPs (Full Die)
Peak FP16 Compute25.0 TFLOPsTBC24.0 TFLOPs120 Tensor TFLOPs
VRAM16 GB HBM2TBC16 GB HBM216 GB HBM2
Memory (Consumer Cards)HBM2HBM3GDDR5XGDDR6
Memory (Dual-Chip Professional/ HPC)HBM2HBM3HBM2HBM2
HBM2 Bandwidth484 GB/s (Frontier Edition)>1 TB/s?732 GB/s (Peak)900 GB/s
Graphics ArchitectureNext Compute Unit (Vega)Next Compute Unit (Navi)5th Gen Pascal CUDA6th Gen Volta CUDA
Successor of (GPU)Radeon RX 500 SeriesRadeon RX 600 SeriesGM200 (Maxwell)GP100 (Pascal)
Launch2017201920162017

NVIDIA has stated that the NVIDIA Volta GV100 GPU based Tesla V100 will start shipping in 2017. We are looking at availability in 2H 2017 so we can expect consumer variants well and ready for launch in early 2018.

Submit