NVIDIA Volta GV100 12nm FinFET GPU Detailed – Tesla V100 Specifications Include 21 Billion Transistors, 5120 CUDA Cores, 16 GB HBM2 With 900 GB/s Bandwidth

Author Photo
May 10, 2017
21Shares
Submit

NVIDIA Volta has just been announced at GTC 2017 and boy it’s a beast. The next-generation graphics processing unit is the world’s first chip that will make use of the industry leading TSMC 12nm FinFET process, so let’s cover every detail of this compute powerhouse.

NVIDIA Volta GV100 Unveiled – Tesla V100 With 5120 CUDA Cores, 16 GB HBM2 and 12nm FinFET Process

Last GTC, NVIDIA announced the Pascal based GP100 GPU. It was back then, the fastest graphics chip designed for supercomputers. This year, NVIDIA is taking the next leap in graphics performance and announced their Volta based GV100 GPU. We are going to take a very deep look at the next-generation GPU designed for AI Deep Learning.

geforce-rtx-2080-ti-gallery-aRelated NVIDIA GeForce RTX 2080 Ti GPU Is 6x Times Faster In Ray Tracing Performance and 10x Faster In AI – RTX 2080 And RTX 2070 Also Equipped With RT and AI Engines

“Artificial intelligence is driving the greatest technology advances in human history,” said Jensen Huang, founder and chief executive officer of NVIDIA, who unveiled Volta at his GTC keynote. “It will automate intelligence and spur a wave of social progress unmatched since the industrial revolution.

“Deep learning, a groundbreaking AI approach that creates computer software that learns, has insatiable demand for processing power. Thousands of NVIDIA engineers spent over three years crafting Volta to help meet this need, enabling the industry to realize AI’s life-changing potential,” he said.

Volta, NVIDIA’s seventh-generation GPU architecture, is built with 21 billion transistors and delivers the equivalent performance of 100 CPUs for deep learning.

It provides a 5x improvement over Pascal, the current-generation NVIDIA GPU architecture, in peak teraflops, and 15x over the Maxwell architecture, launched two years ago. This performance surpasses by 4x the improvements that Moore’s law would have predicted.

via NVIDIA

First of all, we need to talk about the workloads this specific chip is designed to handle. The NVIDIA Volta GV100 GPU is designed to power the most computationally intensive HPC, AI, and graphics workloads.

nvidia-geforce-rtx-2080-teaseRelated NVIDIA GeForce RTX 2080 Ti 11 GB and RTX 2080 8 GB Graphics Cards Core Specifications Confirmed – 2080 Ti With TU102 GPU Rocks 4352 CUDA Cores, 2080 With TU104 Rocks 2944 CUDA Cores

The GV100 GPU includes 21.1 billion transistors with a die size of 815 mm2. It is fabricated on a new TSMC 12 nm FFN high performance manufacturing process customized for NVIDIA. The GPU is much bigger than the 610mm2 Pascal GP100 GPU. NVIDIA Volta GV100 delivers considerably more compute performance, and adds many new features compared to its predecessor, the Pascal GP100 GPU and its architecture family. Further simplifying GPU programming and application porting, GV100 also improves GPU resource utilization. GV100 is an extremely power-efficient processor, delivering exceptional performance per watt.

The chip itself is a behometh, featuring a brand new chip architecture that is just insane in terms of raw specifications. The NVIDIA Volta GV100 GPU is composed of six GPC (Graphics Processing Clusters). It has a total of 84 Volta streaming multiprocessor units, 42 TPCs (each including two SMs). The 84 SMs come with 64 CUDA cores per SM so we are looking at a total of 5376 CUDA cores on the complete die. All of the 5376 CUDA Cores can be used for FP32 and INT32 programming instructions while there are also a total of 2688 FP64 (Double Precision) cores. Aside from these, we are looking at 672 Tensor processors, 336 Texture Units.

The memory architecture is updated with eight 512-bit memory controllers. This rounds up to a total of 4096-bit bus interface that supports up to 16 GB of HBM2 VRAM. The bandwidth is boosted with speeds of 878 MHz, which delivers increased transfer rates of 900 GB/s compared to 720 GB/s on Pascal GP100. Each memory controller is attached to 768 KB of L2 cache which totals to 6 MB of L2 cache for the entire chip.

NVIDIA Tesla Graphics Cards Comparison:

Tesla Graphics Card Name NVIDIA Tesla M2090 NVIDIA Tesla K40 NVIDIA Telsa K80 NVIDIA Tesla P100 NVIDIA Tesla V100
GPU Process 40nm 28nm 28nm 16nm 12nm
GPU Name GF110 GK110 GK210 x 2 GP100 GV100
Die Size 520mm2 561mm2 561mm2 610mm2 815mm2
Transistor Count 3.00 Billion 7.08 Billion 7.08 Billion 15 Billion 21.1 Billion
CUDA Cores 512 CCs (16 CUs) 2880 CCs (15 CUs) 2496 CCs (13 CUs) x 2 3840 CCs 5120 CCs
Core Clock Up To 650 MHz Up To 875 MHz Up To 875 MHz Up To 1480 MHz Up To 1455 MHz
FP32 Compute 1.33 TFLOPs 4.29 TFLOPs 8.74 TFLOPs 10.6 TFLOPs 15.0 TFLOPs
FP64 Compute 0.66 TFLOPs 1.43 TFLOPs 2.91 TFLOPs 5.30 TFLOPs 7.50 TFLOPs
VRAM Size 6 GB 12 GB 12 GB x 2 16 GB 16 GB
VRAM Type GDDR5 GDDR5 GDDR5 HBM2 HBM2
VRAM Bus 384-bit 384-bit 384-bit x 2 4096-bit 4096-bit
VRAM Speed 3.7 GHz 6 GHz 5 GHz 737 MHz 878 MHz
Memory Bandwidth 177.6 GB/s 288 GB/s 240 GB/s 720 GB/s 900 GB/s
Maximum TDP 250W 300W 235W 300W 300W

NVIDIA Volta SM (Streaming Multiprocessor)

Architected to deliver higher performance, the Volta SM has lower instruction and cache latencies than past SM designs and includes new features to accelerate deep learning applications.

Major Features include:

  • New mixed-precision FP16/FP32 Tensor Cores purpose-built for deep learning matrix arithmetic;
  • Enhanced L1 data cache for higher performance and lower latency;
  • Streamlined instruction set for simpler decoding and reduced instruction latencies;
  • Higher clocks and higher power efficiency.

Similar to Pascal GP100, the GV100 SM incorporates 64 FP32 cores and 32 FP64 cores per SM. However, the GV100 SM uses a new partitioning method to improve SM utilization and overall performance. Recall the GP100 SM is partitioned into two processing blocks, each with 32 FP32 Cores, 16 FP64 Cores, an instruction buffer, one warp scheduler, two dispatch units, and a 128 KB Register File. The GV100 SM is partitioned into four processing blocks, each with 16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two of the new mixed-precision Tensor Cores for deep learning matrix arithmetic, a new L0 instruction cache, one warp scheduler, one dispatch unit, and a 64 KB Register File. Note that the new L0 instruction cache is now used in each partition to provide higher efficiency than the instruction buffers used in prior NVIDIA GPUs.

While a GV100 SM has the same number of registers as a Pascal GP100 SM, the entire GV100 GPU has far more SMs, and thus many more registers overall. In aggregate, GV100 supports more threads, warps, and thread blocks in flight compared to prior GPU generations.

Overall shared memory across the entire GV100 GPU is increased due to the increased SM count and potential for up to 96 KB of Shared Memory per SM, compared to 64 KB in GP100.

Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal.

NVIDIA Volta Tesla V100 Specs:

NVIDIA Tesla Graphics Card Tesla K40
(PCI-Express)
Tesla M40
(PCI-Express)
Tesla P100
(PCI-Express)
Tesla P100
(PCI-Express)
Tesla P100 (SXM2) Tesla V100 (PCI-Express) Tesla V100 (SXM2)
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal) GP100 (Pascal) GV100 (Volta) GV100 (Volta)
Process Node 28nm 28nm 16nm 16nm 16nm 12nm 12nm
Transistors 7.1 Billion 8 Billion 15.3 Billion 15.3 Billion 15.3 Billion 21.1 Billion 21.1 Billion
GPU Die Size 551 mm2 601 mm2 610 mm2 610 mm2 610 mm2 815mm2 815mm2
SMs 15 24 56 56 56 80 80
TPCs 15 24 28 28 28 40 40
CUDA Cores Per SM 192 128 64 64 64 64 64
CUDA Cores (Total) 2880 3072 3584 3584 3584 5120 5120
FP64 CUDA Cores / SM 64 4 32 32 32 32 32
FP64 CUDA Cores / GPU 960 96 1792 1792 1792 2560 2560
Base Clock 745 MHz 948 MHz TBD TBD 1328 MHz TBD 1370 MHz
Boost Clock 875 MHz 1114 MHz 1300MHz 1300MHz 1480 MHz 1370 MHz 1455 MHz
FP16 Compute N/A N/A 18.7 TFLOPs 18.7 TFLOPs 21.2 TFLOPs 28.0 TFLOPs 30.0 TFLOPs
FP32 Compute 5.04 TFLOPs 6.8 TFLOPs 10.0 TFLOPs 10.0 TFLOPs 10.6 TFLOPs 14.0 TFLOPs 15.0 TFLOPs
FP64 Compute 1.68 TFLOPs 0.2 TFLOPs 4.7 TFLOPs 4.7 TFLOPs 5.30 TFLOPs 7.0 TFLOPs 7.50 TFLOPs
Texture Units 240 192 224 224 224 320 320
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2
Memory Size 12 GB GDDR5 @ 288 GB/s 24 GB GDDR5 @ 288 GB/s 12 GB HBM2 @ 549 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 900 GB/s 16 GB HBM2 @ 900 GB/s
L2 Cache Size 1536 KB 3072 KB 4096 KB 4096 KB 4096 KB 6144 KB 6144 KB
TDP 235W 250W 250W 250W 300W 250W 300W

NVIDIA VOLTA GV100 GPU WITH ADVANCED TENSOR CORES

Tesla P100 delivered considerably higher performance for training neural networks compared to the prior generation NVIDIA Maxwell and Kepler architectures, but the complexity and size of neural networks have continued to grow. New networks that have thousands of layers and millions of neurons demand even higher performance and faster training times.

New Tensor Cores are the most important feature of the Volta GV100 architecture to help deliver the performance required to train large neural networks. Tesla V100’s Tensor Cores deliver up to 120 Tensor TFLOPS for training and inference applications. Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 for deep learning training compared to P100 FP32 operations, and for deep learning inference, up to 6x higher peak TFLOPS  compared to P100 FP16 operations. The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM.

Matrix-Matrix multiplication (BLAS GEMM) operations are at the core of neural network training and inferencing, and are used to multiply large matrices of input data and weights in the connected layers of the network. As Figure 6 shows, Tensor Cores in the Tesla V100 GPU boost the performance of these operations by more than 9x compared to the Pascal-based GP100 GPU.

GPU Kepler GK110 Maxwell GM200 Pascal GP100 Volta GV100
Compute Capability 3.5 5.3 6.0 7.0
Threads / Warp 32 32 32 32
Max Warps / Multiprocessor 64 64 64 64
Max Threads / Multiprocessor 2048 2048 2048 2048
Max Thread Blocks / Multiprocessor 16 32 32 32
Max 32-bit Registers / SM 65536 65536 65536 65536
Max Registers / Block 65536 32768 65536 65536
Max Registers / Thread 255 255 255 255
Max Thread Block Size 1024 1024 1024 1024
CUDA Cores / SM 192 128 64 64
Shared Memory Size / SM Configurations (bytes) 16K/32K/48K 96K 64K 96K

NVIDIA VOLTA GV100 GPU WITH ENHANCED L1 DATA CACHE AND SHARED MEMORY

The new combined L1 data cache and shared memory subsystem of the Volta SM significantly improves performance while also simplifying programming and reducing the tuning required to attain at or near-peak application performance.

Combining data cache and shared memory functionality into a single memory block provides the best overall performance for both types of memory accesses. The combined capacity is 128 KB/SM, more than 7 times larger than the GP100 data cache, and all of it is usable as a cache by programs that do not use shared memory. Texture units also use the cache. For example, if shared memory is configured to 64 KB, texture and load/store operations can use the remaining 64 KB of L1.

Integration within the shared memory block ensures the Volta GV100 L1 cache has much lower latency and higher bandwidth than the L1 caches in past NVIDIA GPUs. The L1 In Volta functions as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data—the best of both worlds. This combination is unique to Volta and delivers more accessible performance than in the past.

NVIDIA Volta GV100 GPU Key Features:

Key compute features of the NVIDIA Volta GV100 based Tesla V100 include the following:

  • New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU. The new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope. New Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPs for training. With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations. Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming.
  • Second-Generation NVLink The second generation of NVIDIA’s NVLink high-speed interconnect delivers higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations. GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. NVLink now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers. The new NVIDIA DGX-1 with V100 AI supercomputer uses NVLink to deliver greater scalability for ultra-fast deep learning training.
  • HBM2 Memory: Faster, Higher Efficiency Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth. The combination of both a new generation HBM2 memory from Samsung, and a new generation memory controller in Volta, provides 1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95% memory bandwidth efficiency running many workloads.
  • Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.
  • Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.
  • Cooperative Groups and New Cooperative Launch APIs Cooperative Groups is a new programming model introduced in CUDA 9 for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new Cooperative Launch APIs that support synchronization amongst CUDA thread blocks. Volta adds support for new synchronization patterns.
  • Maximum Performance and Maximum Efficiency Modes In Maximum Performance mode, the Tesla V100 accelerator will operate unconstrained up to its TDP (Thermal Design Power) level of 300W to accelerate applications that require the fastest computational speed and highest data throughput. Maximum Efficiency Mode allows data center managers to tune power usage of their Tesla V100 accelerators to operate with optimal performance per watt. A not-to-exceed power cap can be set across all GPUs in a rack, reducing power consumption dramatically, while still obtaining excellent rack performance.
  • Volta Optimized Software New versions of deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others harness the performance of Volta to deliver dramatically faster training times and higher multi-node training performance. Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications. The NVIDIA CUDA Toolkit version 9.0 includes new APIs and support for Volta features to provide even easier programmability.
GPU Family AMD Vega AMD Navi NVIDIA Pascal NVIDIA Volta
Flagship GPU Vega 10 Navi 10 NVIDIA GP100 NVIDIA GV100
GPU Process 14nm FinFET 7nm FinFET TSMC 16nm FinFET TSMC 12nm FinFET
GPU Transistors 15-18 Billion TBC 15.3 Billion 21.1 Billion
GPU Cores (Max) 4096 SPs TBC 3840 CUDA Cores 5376 CUDA Cores
Peak FP32 Compute 13.0 TFLOPs TBC 12.0 TFLOPs >15.0 TFLOPs (Full Die)
Peak FP16 Compute 25.0 TFLOPs TBC 24.0 TFLOPs 120 Tensor TFLOPs
VRAM 16 GB HBM2 TBC 16 GB HBM2 16 GB HBM2
Memory (Consumer Cards) HBM2 HBM3 GDDR5X GDDR6
Memory (Dual-Chip Professional/ HPC) HBM2 HBM3 HBM2 HBM2
HBM2 Bandwidth 484 GB/s (Frontier Edition) >1 TB/s? 732 GB/s (Peak) 900 GB/s
Graphics Architecture Next Compute Unit (Vega) Next Compute Unit (Navi) 5th Gen Pascal CUDA 6th Gen Volta CUDA
Successor of (GPU) Radeon RX 500 Series Radeon RX 600 Series GM200 (Maxwell) GP100 (Pascal)
Launch 2017 2019 2016 2017

NVIDIA has stated that the NVIDIA Volta GV100 GPU based Tesla V100 will start shipping in 2017. We are looking at availability in 2H 2017 so we can expect consumer variants well and ready for launch in early 2018.

Submit