NVIDIA Pascal GP100 GPU Benchmarks Unveiled – Tesla P100 Is The Fastest Graphics Card Ever Created For Hyperscale Computing

Author Photo
Jun 2, 2016

The first benchmark results of the NVIDIA GP100 GPU accelerator have been revealed (via Exxact Corp). Featured on the Tesla P100 graphics board, the GP100 GPU is directed at hyperscale servers and high-performance computing (HPC) in general. The Tesla P100 is already shipping to NVIDIA’s priority customers which includes super computing companies and one of the organization decided to shows us the first performance numbers of the GP100 GPU. (Source: PCGamesHardware via Videocardz)

At GTC 2016, NVIDIA announced the Tesla P100, their most advanced hyperscale GPU to date.

NVIDIA Tesla P100 Accelerator Benched in HPC Workloads – GP100 GPU First Tests Unveiled

The benchmarks we will be looking at are from a tool known as AMBER which stands for Assisted Model Building with Energy Refinement. This tool was co-developed by Ross Walker from San Diego Supercomputer Center and Scott Le Grand from Amazon Web Services. Amber has two uses, it simulates how force fields are used to affect biomolecules. It also contains package of molecular simulation programs such as source codes and demos.

“Amber” refers to two things: a set of molecular mechanical force fields for the simulation of biomolecules (which are in the public domain, and are used in a variety of simulation programs); and a package of molecular simulation programswhich includes source code and demos. Amber is distributed in two parts: AmberTools16 and Amber16. via Ambermd.org

All of these benchmarks are part of HPC simulations and have nothing do with general performance in gaming applications. It provides us an overview of how well the GP100 GPU performs in such tasks against a range of other NVIDIA GPUs such as GP104, GM200 and GK110. The following configuration was used in the benchmark run:

Exxact AMBER Certified 2U GPU Workstation:

  • CPU = Dual x 8 Core Intel E5-2650v3 (2.3GHz), 64 GB DDR4 Ram
  • (note the cheaper 6 Core E5-2620v3 and v4 CPUs would also give the same performance for GPU runs)
  • MPICH v3.1.4 – GNU v4.8.5 – Centos 7.2
  • CUDA Toolkit NVCC v7.5 (8.0RC1 for GTX-1080 and P100)
  • NVIDIA Driver Linux 64 – 361.43
  • Precision Model = SPFP (GPU), Double Precision (CPU)

Now there’s a few things to note before we look at the benchmarks. The tests were conducted on SPFP (GPU) precision model. This means that all GPUs used their single precision throughput to conduct these benchmarks while the CPUs were ran in double precision (FP64) model. It should also be mentioned that these tests were conducted at the time when both Tesla P100 and GTX 1080 were not publicly launched. So the question arises, how did Amber managed to get these cards before their announcement?

Amber works really close to NVIDIA and since the program was developed and written with NVIDIA’s help to accelerate research based simulations, Amber managed to get first hand access on these cards. This however means that both cards are in engineering phase and should not be compared to the final retail versions whose performance should be better optimized. The NVIDIA Pascal GPUs also ran a pre-release version of CUDA 8.0.

At the time of writing GTX-1080 and P100 (DGX-1) cards had not been publically released. The benchmarks here are from pre-release hardware. As such they represent a bottom end to the performance. It is hoped that with access to released hardware that optimization of AMBER 16 specific to the Pascal architecture will be possible resulting in improved performance. (Pascal hardware benchmarks made use of a pre-release version of CUDA 8.0). So without further a do, let’s take a look at the benchmarks:

NVIDIA Tesla P100 GP100 GPU Benchmarks:

In the benchmarks provided below, we can see that a single Tesla P100 is giving enough throughput to out perform a quad Titan X configuration. We also note that in some cases, the GeForce GTX 1080 is around as fast as the GP100 GPU which is due to the fact that GP104 is also a 9.3 TFLOPs graphics chip which is close to the 10 (10.6) TFLOPs output of the Tesla P100 accelerator. That changes when multiple boards are used. Tesla P100 is fastest without a doubt but with proper implementation of NVLINK in the final models which are now shipping to customers, we can see even bigger gains.

The NVIDIA DGX-1 is a supercomputing rack capable of delivering up to 170 TFLOPs of compute performance.

The NVIDIA DGX-1 system uses up to 8 Tesla P100 boards and costs $129,000 US. The system includes the following specifications:

  • Up to 170 teraflops of half-precision (FP16) peak performance
  • Eight Tesla P100 GPU accelerators, 16GB memory per GPU
  • NVLink Hybrid Cube Mesh
  • 7TB SSD DL Cache
  • Dual 10GbE, Quad InfiniBand 100Gb networking
  • 3U – 3200W

NVIDIA Pascal GP100 With Tesla P100 Graphics Board Benchmarks (Image Credits: Ambermd)

The following tests are too small to  effectively scale to multiple modern GPUs and since we are looking at pre-release hardware, NVLINK isn’t fine tuned to make use of all Tesla P100 hardware (Up To 4 in the benchmarks provided below).

NVIDIA Pascal GP100 With Tesla P100 Graphics Board Benchmarks (Image Credits: Ambermd)

For those expecting gaming benchmarks, we already made it clear that these results have nothing to do with general application performance. These workloads are specific to the HPC sector and that’s what the GP100 GPU has been designed to handle. We have heard rumors that NVIDIA is preparing a more cost effective 16 FinFET based GP102 GPU which might launched later this year as a flagship Titan product with similar specs as the Tesla P100. We don’t have any confirmation but we will update you as more news comes our way.

NVIDIA Volta Tesla V100 Specs:

NVIDIA Tesla Graphics Card Tesla K40
Tesla M40
Tesla P100
Tesla P100
Tesla P100 (SXM2) Tesla V100 (PCI-Express) Tesla V100 (SXM2)
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GP100 (Pascal) GP100 (Pascal) GV100 (Volta) GV100 (Volta)
Process Node 28nm 28nm 16nm 16nm 16nm 12nm 12nm
Transistors 7.1 Billion 8 Billion 15.3 Billion 15.3 Billion 15.3 Billion 21.1 Billion 21.1 Billion
GPU Die Size 551 mm2 601 mm2 610 mm2 610 mm2 610 mm2 815mm2 815mm2
SMs 15 24 56 56 56 80 80
TPCs 15 24 28 28 28 40 40
CUDA Cores Per SM 192 128 64 64 64 64 64
CUDA Cores (Total) 2880 3072 3584 3584 3584 5120 5120
FP64 CUDA Cores / SM 64 4 32 32 32 32 32
FP64 CUDA Cores / GPU 960 96 1792 1792 1792 2560 2560
Base Clock 745 MHz 948 MHz TBD TBD 1328 MHz TBD 1370 MHz
Boost Clock 875 MHz 1114 MHz 1300MHz 1300MHz 1480 MHz 1370 MHz 1455 MHz
FP16 Compute N/A N/A 18.7 TFLOPs 18.7 TFLOPs 21.2 TFLOPs 28.0 TFLOPs 30.0 TFLOPs
FP32 Compute 5.04 TFLOPs 6.8 TFLOPs 10.0 TFLOPs 10.0 TFLOPs 10.6 TFLOPs 14.0 TFLOPs 15.0 TFLOPs
FP64 Compute 1.68 TFLOPs 0.2 TFLOPs 4.7 TFLOPs 4.7 TFLOPs 5.30 TFLOPs 7.0 TFLOPs 7.50 TFLOPs
Texture Units 240 192 224 224 224 320 320
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2 4096-bit HBM2
Memory Size 12 GB GDDR5 @ 288 GB/s 24 GB GDDR5 @ 288 GB/s 12 GB HBM2 @ 549 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 732 GB/s 16 GB HBM2 @ 900 GB/s 16 GB HBM2 @ 900 GB/s
L2 Cache Size 1536 KB 3072 KB 4096 KB 4096 KB 4096 KB 6144 KB 6144 KB
TDP 235W 250W 250W 250W 300W 250W 300W