Announcement GTC 2026 Hardware PC

NVIDIA Volta GV100 12nm FinFET GPU Detailed – Tesla V100 Specifications Include 21 Billion Transistors, 5120 CUDA Cores, 16 GB HBM2 With 900 GB/s Bandwidth

Hassan Mujtaba

• May 10, 2017 at 01:29pm EDT

NVIDIA Volta has just been announced at GTC 2017 and boy it's a beast. The next-generation graphics processing unit is the world's first chip that will make use of the industry leading TSMC 12nm FinFET process, so let's cover every detail of this compute powerhouse.

NVIDIA Volta GV100 Unveiled - Tesla V100 With 5120 CUDA Cores, 16 GB HBM2 and 12nm FinFET Process

Last GTC, NVIDIA announced the Pascal based GP100 GPU. It was back then, the fastest graphics chip designed for supercomputers. This year, NVIDIA is taking the next leap in graphics performance and announced their Volta based GV100 GPU. We are going to take a very deep look at the next-generation GPU designed for AI Deep Learning.

"Artificial intelligence is driving the greatest technology advances in human history," said Jensen Huang, founder and chief executive officer of NVIDIA, who unveiled Volta at his GTC keynote. "It will automate intelligence and spur a wave of social progress unmatched since the industrial revolution.

"Deep learning, a groundbreaking AI approach that creates computer software that learns, has insatiable demand for processing power. Thousands of NVIDIA engineers spent over three years crafting Volta to help meet this need, enabling the industry to realize AI's life-changing potential," he said.

Volta, NVIDIA's seventh-generation GPU architecture, is built with 21 billion transistors and delivers the equivalent performance of 100 CPUs for deep learning.

It provides a 5x improvement over Pascal, the current-generation NVIDIA GPU architecture, in peak teraflops, and 15x over the Maxwell architecture, launched two years ago. This performance surpasses by 4x the improvements that Moore's law would have predicted.

via NVIDIA

First of all, we need to talk about the workloads this specific chip is designed to handle. The NVIDIA Volta GV100 GPU is designed to power the most computationally intensive HPC, AI, and graphics workloads.

The GV100 GPU includes 21.1 billion transistors with a die size of 815 mm2. It is fabricated on a new TSMC 12 nm FFN high performance manufacturing process customized for NVIDIA. The GPU is much bigger than the 610mm2 Pascal GP100 GPU. NVIDIA Volta GV100 delivers considerably more compute performance, and adds many new features compared to its predecessor, the Pascal GP100 GPU and its architecture family. Further simplifying GPU programming and application porting, GV100 also improves GPU resource utilization. GV100 is an extremely power-efficient processor, delivering exceptional performance per watt.

The chip itself is a behometh, featuring a brand new chip architecture that is just insane in terms of raw specifications. The NVIDIA Volta GV100 GPU is composed of six GPC (Graphics Processing Clusters). It has a total of 84 Volta streaming multiprocessor units, 42 TPCs (each including two SMs). The 84 SMs come with 64 CUDA cores per SM so we are looking at a total of 5376 CUDA cores on the complete die. All of the 5376 CUDA Cores can be used for FP32 and INT32 programming instructions while there are also a total of 2688 FP64 (Double Precision) cores. Aside from these, we are looking at 672 Tensor processors, 336 Texture Units.

The memory architecture is updated with eight 512-bit memory controllers. This rounds up to a total of 4096-bit bus interface that supports up to 16 GB of HBM2 VRAM. The bandwidth is boosted with speeds of 878 MHz, which delivers increased transfer rates of 900 GB/s compared to 720 GB/s on Pascal GP100. Each memory controller is attached to 768 KB of L2 cache which totals to 6 MB of L2 cache for the entire chip.

NVIDIA Tesla Graphics Cards Comparison:

Tesla Graphics Card Name	NVIDIA Tesla M2090	NVIDIA Tesla K40	NVIDIA Telsa K80	NVIDIA Tesla P100	NVIDIA Tesla V100
GPU Architecture	Fermi	Kepler	Maxwell	Pascal	Volta
GPU Process	40nm	28nm	28nm	16nm	12nm
GPU Name	GF110	GK110	GK210 x 2	GP100	GV100
Die Size	520mm2	561mm2	561mm2	610mm2	815mm2
Transistor Count	3.00 Billion	7.08 Billion	7.08 Billion	15 Billion	21.1 Billion
CUDA Cores	512 CCs (16 CUs)	2880 CCs (15 CUs)	2496 CCs (13 CUs) x 2	3840 CCs	5120 CCs
Core Clock	Up To 650 MHz	Up To 875 MHz	Up To 875 MHz	Up To 1480 MHz	Up To 1455 MHz
FP32 Compute	1.33 TFLOPs	4.29 TFLOPs	8.74 TFLOPs	10.6 TFLOPs	15.0 TFLOPs
FP64 Compute	0.66 TFLOPs	1.43 TFLOPs	2.91 TFLOPs	5.30 TFLOPs	7.50 TFLOPs
VRAM Size	6 GB	12 GB	12 GB x 2	16 GB	16 GB
VRAM Type	GDDR5	GDDR5	GDDR5	HBM2	HBM2
VRAM Bus	384-bit	384-bit	384-bit x 2	4096-bit	4096-bit
VRAM Speed	3.7 GHz	6 GHz	5 GHz	737 MHz	878 MHz
Memory Bandwidth	177.6 GB/s	288 GB/s	240 GB/s	720 GB/s	900 GB/s
Maximum TDP	250W	300W	235W	300W	300W

NVIDIA Volta SM (Streaming Multiprocessor)

Architected to deliver higher performance, the Volta SM has lower instruction and cache latencies than past SM designs and includes new features to accelerate deep learning applications.

Major Features include:

New mixed-precision FP16/FP32 Tensor Cores purpose-built for deep learning matrix arithmetic;
Enhanced L1 data cache for higher performance and lower latency;
Streamlined instruction set for simpler decoding and reduced instruction latencies;
Higher clocks and higher power efficiency.

Similar to Pascal GP100, the GV100 SM incorporates 64 FP32 cores and 32 FP64 cores per SM. However, the GV100 SM uses a new partitioning method to improve SM utilization and overall performance. Recall the GP100 SM is partitioned into two processing blocks, each with 32 FP32 Cores, 16 FP64 Cores, an instruction buffer, one warp scheduler, two dispatch units, and a 128 KB Register File. The GV100 SM is partitioned into four processing blocks, each with 16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two of the new mixed-precision Tensor Cores for deep learning matrix arithmetic, a new L0 instruction cache, one warp scheduler, one dispatch unit, and a 64 KB Register File. Note that the new L0 instruction cache is now used in each partition to provide higher efficiency than the instruction buffers used in prior NVIDIA GPUs.

While a GV100 SM has the same number of registers as a Pascal GP100 SM, the entire GV100 GPU has far more SMs, and thus many more registers overall. In aggregate, GV100 supports more threads, warps, and thread blocks in flight compared to prior GPU generations.

Overall shared memory across the entire GV100 GPU is increased due to the increased SM count and potential for up to 96 KB of Shared Memory per SM, compared to 64 KB in GP100.

Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal.

NVIDIA Volta Tesla V100S Specs:

NVIDIA Tesla Graphics Card	Tesla K40 (PCI-Express)	Tesla M40 (PCI-Express)	Tesla P100 (PCI-Express)	Tesla P100 (SXM2)	Tesla V100 (PCI-Express)	Tesla V100 (SXM2)	Tesla V100S (PCIe)
GPU	GK110 (Kepler)	GM200 (Maxwell)	GP100 (Pascal)	GP100 (Pascal)	GV100 (Volta)	GV100 (Volta)	GV100 (Volta)
Process Node	28nm	28nm	16nm	16nm	12nm	12nm	12nm
Transistors	7.1 Billion	8 Billion	15.3 Billion	15.3 Billion	21.1 Billion	21.1 Billion	21.1 Billion
GPU Die Size	551 mm2	601 mm2	610 mm2	610 mm2	815mm2	815mm2	815mm2
SMs	15	24	56	56	80	80	80
TPCs	15	24	28	28	40	40	40
CUDA Cores Per SM	192	128	64	64	64	64	64
CUDA Cores (Total)	2880	3072	3584	3584	5120	5120	5120
Texture Units	240	192	224	224	320	320	320
FP64 CUDA Cores / SM	64	4	32	32	32	32	32
FP64 CUDA Cores / GPU	960	96	1792	1792	2560	2560	2560
Base Clock	745 MHz	948 MHz	1190 MHz	1328 MHz	1230 MHz	1297 MHz	TBD
Boost Clock	875 MHz	1114 MHz	1329MHz	1480 MHz	1380 MHz	1530 MHz	1601 MHz
FP16 Compute	N/A	N/A	18.7 TFLOPs	21.2 TFLOPs	28.0 TFLOPs	30.4 TFLOPs	32.8 TFLOPs
FP32 Compute	5.04 TFLOPs	6.8 TFLOPs	10.0 TFLOPs	10.6 TFLOPs	14.0 TFLOPs	15.7 TFLOPs	16.4 TFLOPs
FP64 Compute	1.68 TFLOPs	0.2 TFLOPs	4.7 TFLOPs	5.30 TFLOPs	7.0 TFLOPs	7.80 TFLOPs	8.2 TFLOPs
Memory Interface	384-bit GDDR5	384-bit GDDR5	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM
Memory Size	12 GB GDDR5 @ 288 GB/s	24 GB GDDR5 @ 288 GB/s	16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s	16 GB HBM2 @ 732 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 1134 GB/s
L2 Cache Size	1536 KB	3072 KB	4096 KB	4096 KB	6144 KB	6144 KB	6144 KB
TDP	235W	250W	250W	300W	250W	300W	250W

NVIDIA VOLTA GV100 GPU WITH ADVANCED TENSOR CORES

Tesla P100 delivered considerably higher performance for training neural networks compared to the prior generation NVIDIA Maxwell and Kepler architectures, but the complexity and size of neural networks have continued to grow. New networks that have thousands of layers and millions of neurons demand even higher performance and faster training times.

New Tensor Cores are the most important feature of the Volta GV100 architecture to help deliver the performance required to train large neural networks. Tesla V100’s Tensor Cores deliver up to 120 Tensor TFLOPS for training and inference applications. Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 for deep learning training compared to P100 FP32 operations, and for deep learning inference, up to 6x higher peak TFLOPS compared to P100 FP16 operations. The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM.

Matrix-Matrix multiplication (BLAS GEMM) operations are at the core of neural network training and inferencing, and are used to multiply large matrices of input data and weights in the connected layers of the network. As Figure 6 shows, Tensor Cores in the Tesla V100 GPU boost the performance of these operations by more than 9x compared to the Pascal-based GP100 GPU.

GPU	Kepler GK110	Maxwell GM200	Pascal GP100	Volta GV100
Compute Capability	3.5	5.3	6.0	7.0
Threads / Warp	32	32	32	32
Max Warps / Multiprocessor	64	64	64	64
Max Threads / Multiprocessor	2048	2048	2048	2048
Max Thread Blocks / Multiprocessor	16	32	32	32
Max 32-bit Registers / SM	65536	65536	65536	65536
Max Registers / Block	65536	32768	65536	65536
Max Registers / Thread	255	255	255	255
Max Thread Block Size	1024	1024	1024	1024
CUDA Cores / SM	192	128	64	64
Shared Memory Size / SM Configurations (bytes)	16K/32K/48K	96K	64K	96K

NVIDIA VOLTA GV100 GPU WITH ENHANCED L1 DATA CACHE AND SHARED MEMORY

The new combined L1 data cache and shared memory subsystem of the Volta SM significantly improves performance while also simplifying programming and reducing the tuning required to attain at or near-peak application performance.

Combining data cache and shared memory functionality into a single memory block provides the best overall performance for both types of memory accesses. The combined capacity is 128 KB/SM, more than 7 times larger than the GP100 data cache, and all of it is usable as a cache by programs that do not use shared memory. Texture units also use the cache. For example, if shared memory is configured to 64 KB, texture and load/store operations can use the remaining 64 KB of L1.

Integration within the shared memory block ensures the Volta GV100 L1 cache has much lower latency and higher bandwidth than the L1 caches in past NVIDIA GPUs. The L1 In Volta functions as a high-throughput conduit for streaming data while simultaneously providing high-bandwidth and low-latency access to frequently reused data—the best of both worlds. This combination is unique to Volta and delivers more accessible performance than in the past.

NVIDIA Volta GV100 GPU Key Features:

Key compute features of the NVIDIA Volta GV100 based Tesla V100 include the following:

New Streaming Multiprocessor (SM) Architecture Optimized for Deep Learning Volta features a major new redesign of the SM processor architecture that is at the center of the GPU. The new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope. New Tensor Cores designed specifically for deep learning deliver up to 12x higher peak TFLOPs for training. With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations. Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming.
Second-Generation NVLink The second generation of NVIDIA’s NVLink high-speed interconnect delivers higher bandwidth, more links, and improved scalability for multi-GPU and multi-GPU/CPU system configurations. GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s. NVLink now supports CPU mastering and cache coherence capabilities with IBM Power 9 CPU-based servers. The new NVIDIA DGX-1 with V100 AI supercomputer uses NVLink to deliver greater scalability for ultra-fast deep learning training.
HBM2 Memory: Faster, Higher Efficiency Volta’s highly tuned 16GB HBM2 memory subsystem delivers 900 GB/sec peak memory bandwidth. The combination of both a new generation HBM2 memory from Samsung, and a new generation memory controller in Volta, provides 1.5x delivered memory bandwidth versus Pascal GP100 and greater than 95% memory bandwidth efficiency running many workloads.
Volta Multi-Process Service Volta Multi-Process Service (MPS) is a new feature of the Volta GV100 architecture providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance, isolation, and better quality of service (QoS) for multiple compute applications sharing the GPU. Volta MPS also triples the maximum number of MPS clients from 16 on Pascal to 48 on Volta.
Enhanced Unified Memory and Address Translation Services GV100 Unified Memory technology in Volta GV100 includes new access counters to allow more accurate migration of memory pages to the processor that accesses the pages most frequently, improving efficiency for accessing memory ranges shared between processors. On IBM Power platforms, new Address Translation Services (ATS) support allows the GPU to access the CPU’s page tables directly.
Cooperative Groups and New Cooperative Launch APIs Cooperative Groups is a new programming model introduced in CUDA 9 for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new Cooperative Launch APIs that support synchronization amongst CUDA thread blocks. Volta adds support for new synchronization patterns.
Maximum Performance and Maximum Efficiency Modes In Maximum Performance mode, the Tesla V100 accelerator will operate unconstrained up to its TDP (Thermal Design Power) level of 300W to accelerate applications that require the fastest computational speed and highest data throughput. Maximum Efficiency Mode allows data center managers to tune power usage of their Tesla V100 accelerators to operate with optimal performance per watt. A not-to-exceed power cap can be set across all GPUs in a rack, reducing power consumption dramatically, while still obtaining excellent rack performance.
Volta Optimized Software New versions of deep learning frameworks such as Caffe2, MXNet, CNTK, TensorFlow, and others harness the performance of Volta to deliver dramatically faster training times and higher multi-node training performance. Volta-optimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the Volta GV100 architecture to deliver higher performance for both deep learning and High Performance Computing (HPC) applications. The NVIDIA CUDA Toolkit version 9.0 includes new APIs and support for Volta features to provide even easier programmability.

GPU Family	AMD Vega	AMD Navi	NVIDIA Pascal	NVIDIA Volta
Flagship GPU	Vega 10	Navi 10	NVIDIA GP100	NVIDIA GV100
GPU Process	14nm FinFET	7nm FinFET	TSMC 16nm FinFET	TSMC 12nm FinFET
GPU Transistors	15-18 Billion	TBC	15.3 Billion	21.1 Billion
GPU Cores (Max)	4096 SPs	TBC	3840 CUDA Cores	5376 CUDA Cores
Peak FP32 Compute	13.0 TFLOPs	TBC	12.0 TFLOPs	>15.0 TFLOPs (Full Die)
Peak FP16 Compute	25.0 TFLOPs	TBC	24.0 TFLOPs	120 Tensor TFLOPs
VRAM	16 GB HBM2	TBC	16 GB HBM2	16 GB HBM2
Memory (Consumer Cards)	HBM2	HBM3	GDDR5X	GDDR6
Memory (Dual-Chip Professional/ HPC)	HBM2	HBM3	HBM2	HBM2
HBM2 Bandwidth	484 GB/s (Frontier Edition)	>1 TB/s?	732 GB/s (Peak)	900 GB/s
Graphics Architecture	Next Compute Unit (Vega)	Next Compute Unit (Navi)	5th Gen Pascal CUDA	6th Gen Volta CUDA
Successor of (GPU)	Radeon RX 500 Series	Radeon RX 600 Series	GM200 (Maxwell)	GP100 (Pascal)
Launch	2017	2019	2016	2017

NVIDIA has stated that the NVIDIA Volta GV100 GPU based Tesla V100 will start shipping in 2017. We are looking at availability in 2H 2017 so we can expect consumer variants well and ready for launch in early 2018.

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Read all comments on NVIDIA Volta GV100 12nm FinFET GPU Detailed – Tesla V100 Specifications Include 21 Billion Transistors, 5120 CUDA Cores, 16 GB HBM2 With 900 GB/s Bandwidth

NVIDIA Volta GV100 12nm FinFET GPU Detailed – Tesla V100 Specifications Include 21 Billion Transistors, 5120 CUDA Cores, 16 GB HBM2 With 900 GB/s Bandwidth

NVIDIA Volta GV100 Unveiled - Tesla V100 With 5120 CUDA Cores, 16 GB HBM2 and 12nm FinFET Process

NVIDIA Tesla Graphics Cards Comparison:

NVIDIA Volta SM (Streaming Multiprocessor)

NVIDIA Volta Tesla V100S Specs:

NVIDIA VOLTA GV100 GPU WITH ADVANCED TENSOR CORES

NVIDIA VOLTA GV100 GPU WITH ENHANCED L1 DATA CACHE AND SHARED MEMORY

NVIDIA Volta GV100 GPU Key Features:

Trending Stories

Valve Is “Irresponsible” To Force AI Disclosures on Steam, Epic CEO Says, While Unreal Engine 6 Doubles Down on AI

GTA 6 Physical Disc Reportedly Arrives This December, While Ray-Traced Screenshots May Not Reflect Console Reality

Qualcomm Claims Single-Core Leadership for Its First Server CPU, the Dragonfly C1000, Delivering 250+ Cores & 5 GHz By 2028

Apple Announces Massive Price Increases Across The Board, With Up To $1,300 Premium Being Charged As The Memory Crisis Claims The Trillion-Dollar Giant

Hygon’s 128-Core & 512-Thread C86 CPU Targets Intel Xeon With 15% IPC Gain, As China Races to Cut Foreign Chip Reliance

Popular Discussions

AMD Reportedly Plots Another 10-15% RX 9000 Price Hike As The RAMpocalypse Swallows The GPU Market

AMD Rolls Out FSR 4.1 For RX 7000 GPUs, Builds a Lightweight ML Model for RDNA 3.5 and RDNA 3 iGPUs

AMD’s FSR 4.1 Doubles RX 7900 XTX frame Rates In Cyberpunk 2077, Jumping From 24 To 50 FPS At 4K

AMD Reverses Course On Removing TSME From Ryzen Chips; Will Reinstate The Feature Through A New BIOS Update

YouTuber Daniel Owen And Club386 Got Their RTX 5090 Connectors Cooked; Club386 Calls It A “Flawed Design”

NVIDIA Volta GV100 12nm FinFET GPU Detailed – Tesla V100 Specifications Include 21 Billion Transistors, 5120 CUDA Cores, 16 GB HBM2 With 900 GB/s Bandwidth

NVIDIA Volta GV100 Unveiled - Tesla V100 With 5120 CUDA Cores, 16 GB HBM2 and 12nm FinFET Process

Related Story JEDEC Approves SPHBM4 to Break HBM’s Costly Packaging Bottleneck, Retaining HBM4-level Speeds With Standard Packages

NVIDIA Tesla Graphics Cards Comparison:

NVIDIA Volta SM (Streaming Multiprocessor)

NVIDIA Volta Tesla V100S Specs:

NVIDIA VOLTA GV100 GPU WITH ADVANCED TENSOR CORES

NVIDIA VOLTA GV100 GPU WITH ENHANCED L1 DATA CACHE AND SHARED MEMORY

NVIDIA Volta GV100 GPU Key Features:

Further Reading

Trending Stories

Popular Discussions