NVIDIA Volta Tesla V100 GPU Accelerator Compute Performance Revealed – Features A Monumental Increase Over Pascal Based Tesla P100

Sep 17, 2017

NVIDIA’s flagship and the fastest graphics accelerator in the world, the Volta GPU based Tesla V100 is now shipping to customers around the globe. The new GPU is a marvel of engineering and it has a broad range of technologies such as the latest 12nm process, NVLINK 2.0, HBM 2.0, Tensor Cores and a highly efficient architecture design that make it the most suitable chip for heavy compute or AI (Deep Learning) workloads.

NVIDIA Volta GV100 GPU Based Tesla V100 Benchmarked – A Monumental Performance Increase in Geekbench Compute Test Over The Pascal GP100 Based Tesla P100

Released just a year after the Pascal based Tesla P100, the Volta based Tesla P100 bests its predecessor in every possible way. And just like its predecessor, the flagship is designed to head over to the deep learning and compute markets. At GTC 2017, we got to learn almost everything about the Volta GV100 GPU but now, we have got the first independent test results and they are a shocker.

Related EXCLUSIVE: NVIDIA’s SUPER GPUs, Unleashing Monsters [Updated]

Tested in Geekbench 4, the system used was an NVIDIA DGX-1. The DGX-1 is what NVIDIA calls a supercomputer inside a box. It’s a powerful machine that manages to deliver some astonishing performance results. As per official claims, the total horsepower on the DGX-1 has been boosted from 170 TFLOPs of FP16 compute to 960 TFLOPs of FP16 compute which is a direct effect of the new Tensor cores that are featured inside the Volta GV100 GPU core.

In terms of specifications, this machine rocks eight Tesla V100 GPUs with 5120 cores each. This totals 40,960 CUDA Cores and 5120 Tensor Cores. The DGX-1 houses a total of 128 GB of HBM2 memory on its eight Tesla V100 GPUs. The system features dual Intel Xeon E5-2698 V4 processors that come with 20 cores, 40 threads and clock in at 2.2 GHz. There’s 512 GB of DDR4 memory inside the system. The storage is provided in the form of four 1.92 TB SSDs configured in RAID 0, network is a dual 10 GbE with up to 4 IB EDR. The system comes with a 3.2 KW PSU. You can find more details here.

Now comes the part where we unveil the results. The NVIDIA DGX-1 currently features the fastest compute performance on the Geekbench 4 database. There’s no setup in sight that can dethrone this beast. The system can be compared to a HP Z8 G4 Workstation which features a total of nine PCIe slots and features a score of 278706 points in the OpenCL API with the Quadro GP100 which is essentially a Tesla P100 spec’d card. Moving over to the fastest Tesla P100 listing, we see a total of 8 PCIe cards configured to reach a score of 320031 in the CUDA API. But let’s take a look at the mind boggling Tesla V100 scores. A DGX-1 system with 8 SXM2 Tesla V100 cards scores 418504 in OpenCL API and a monumental 743537 points with the CUDA API.

The score puts the Tesla V100 in an impressive lead over its predecessor which is something we are excited to see. It also shows that we can be looking at a generational leap in the gaming GPU segment if the performance numbers from the chip architecture carry over to the mainstream markets. Another thing which should be pointed out is the incredible tuning of compute output with the new CUDA API and related libraries. Not only is the Tesla V100 seeing big improvements over OpenCL but the same can be seen for the Tesla P100 which means that NVIDIA is really doing some hard work with their CUDNN framework and it’s expected to get even better in the coming generations. So there you have it, NVIDIA’s fastest GPU showing off some killer performance in its specified compute related workloads.

NVIDIA Volta Tesla V100 Specs:

NVIDIA Tesla Graphics Card Tesla K40
Tesla M40
Tesla P100
Tesla P100
Tesla P100 (SXM2) Tesla V100 (PCI-Express) Tesla V100 (SXM2)
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal) GP100 (Pascal) GV100 (Volta) GV100 (Volta)
Process Node 28nm 28nm 16nm 16nm 16nm 12nm 12nm
Transistors 7.1 Billion 8 Billion 15.3 Billion 15.3 Billion 15.3 Billion 21.1 Billion 21.1 Billion
GPU Die Size 551 mm2 601 mm2 610 mm2 610 mm2 610 mm2 815mm2 815mm2
SMs 15 24 56 56 56 80 80
TPCs 15 24 28 28 28 40 40
CUDA Cores Per SM 192 128 64 64 64 64 64
CUDA Cores (Total) 2880 3072 3584 3584 3584 5120 5120
FP64 CUDA Cores / SM 64 4 32 32 32 32 32
FP64 CUDA Cores / GPU 960 96 1792 1792 1792 2560 2560
Base Clock 745 MHz 948 MHz TBD TBD 1328 MHz TBD 1370 MHz
Boost Clock 875 MHz 1114 MHz 1300MHz 1300MHz 1480 MHz 1370 MHz 1455 MHz
FP16 Compute N/A N/A 18.7 TFLOPs 18.7 TFLOPs 21.2 TFLOPs 28.0 TFLOPs 30.0 TFLOPs
FP32 Compute 5.04 TFLOPs 6.8 TFLOPs 10.0 TFLOPs 10.0 TFLOPs 10.6 TFLOPs 14.0 TFLOPs 15.0 TFLOPs
FP64 Compute 1.68 TFLOPs 0.2 TFLOPs 4.7 TFLOPs 4.7 TFLOPs 5.30 TFLOPs 7.0 TFLOPs 7.50 TFLOPs
Texture Units 240 192 224 224 224 320 320
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2
Memory Size 12 GB GDDR5 @ 288 GB/s 24 GB GDDR5 @ 288 GB/s 12 GB HBM2 @ 549 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 900 GB/s 16 GB HBM2 @ 900 GB/s
L2 Cache Size 1536 KB 3072 KB 4096 KB 4096 KB 4096 KB 6144 KB 6144 KB
TDP 235W 250W 250W 250W 300W 250W 300W