NVIDIA Pascal GP100 GPU Benchmarks Unveiled – Tesla P100 Is The Fastest Graphics Card Ever Created For Hyperscale Computing

Author Photo
Jun 2, 2016
Share Tweet Submit

The first benchmark results of the NVIDIA GP100 GPU accelerator have been revealed. Featured on the Tesla P100 graphics board, the GP100 GPU is directed at hyperscale servers and high-performance computing (HPC) in general. The Tesla P100 is already shipping to NVIDIA’s priority customers which includes super computing companies and one of the organization decided to shows us the first performance numbers of the GP100 GPU. (Source: PCGamesHardware via Videocardz)

At GTC 2016, NVIDIA announced the Tesla P100, their most advanced hyperscale GPU to date.

NVIDIA Tesla P100 Accelerator Benched in HPC Workloads – GP100 GPU First Tests Unveiled

The benchmarks we will be looking at are from a tool known as AMBER which stands for Assisted Model Building with Energy Refinement. This tool was co-developed by Ross Walker from San Diego Supercomputer Center and Scott Le Grand from Amazon Web Services. Amber has two uses, it simulates how force fields are used to affect biomolecules. It also contains package of molecular simulation programs such as source codes and demos.

“Amber” refers to two things: a set of molecular mechanical force fields for the simulation of biomolecules (which are in the public domain, and are used in a variety of simulation programs); and a package of molecular simulation programswhich includes source code and demos. Amber is distributed in two parts: AmberTools16 and Amber16. via Ambermd.org

All of these benchmarks are part of HPC simulations and have nothing do with general performance in gaming applications. It provides us an overview of how well the GP100 GPU performs in such tasks against a range of other NVIDIA GPUs such as GP104, GM200 and GK110. The following configuration was used in the benchmark run:

Exxact AMBER Certified 2U GPU Workstation:

  • CPU = Dual x 8 Core Intel E5-2650v3 (2.3GHz), 64 GB DDR4 Ram
  • (note the cheaper 6 Core E5-2620v3 and v4 CPUs would also give the same performance for GPU runs)
  • MPICH v3.1.4 – GNU v4.8.5 – Centos 7.2
  • CUDA Toolkit NVCC v7.5 (8.0RC1 for GTX-1080 and P100)
  • NVIDIA Driver Linux 64 – 361.43
  • Precision Model = SPFP (GPU), Double Precision (CPU)

Now there’s a few things to note before we look at the benchmarks. The tests were conducted on SPFP (GPU) precision model. This means that all GPUs used their single precision throughput to conduct these benchmarks while the CPUs were ran in double precision (FP64) model. It should also be mentioned that these tests were conducted at the time when both Tesla P100 and GTX 1080 were not publicly launched. So the question arises, how did Amber managed to get these cards before their announcement?

NVIDIA Gains Q3 2016 Discrete Graphics Market Share By 0.9%, AMD's Share Falls By 0.8% - AIB Shipments Up 38.2% in Q3 2016

Amber works really close to NVIDIA and since the program was developed and written with NVIDIA’s help to accelerate research based simulations, Amber managed to get first hand access on these cards. This however means that both cards are in engineering phase and should not be compared to the final retail versions whose performance should be better optimized. The NVIDIA Pascal GPUs also ran a pre-release version of CUDA 8.0.

At the time of writing GTX-1080 and P100 (DGX-1) cards had not been publically released. The benchmarks here are from pre-release hardware. As such they represent a bottom end to the performance. It is hoped that with access to released hardware that optimization of AMBER 16 specific to the Pascal architecture will be possible resulting in improved performance. (Pascal hardware benchmarks made use of a pre-release version of CUDA 8.0). So without further a do, let’s take a look at the benchmarks:

NVIDIA Tesla P100 GP100 GPU Benchmarks:

In the benchmarks provided below, we can see that a single Tesla P100 is giving enough throughput to out perform a quad Titan X configuration. We also note that in some cases, the GeForce GTX 1080 is around as fast as the GP100 GPU which is due to the fact that GP104 is also a 9.3 TFLOPs graphics chip which is close to the 10 (10.6) TFLOPs output of the Tesla P100 accelerator. That changes when multiple boards are used. Tesla P100 is fastest without a doubt but with proper implementation of NVLINK in the final models which are now shipping to customers, we can see even bigger gains.

The NVIDIA DGX-1 is a supercomputing rack capable of delivering up to 170 TFLOPs of compute performance.

The NVIDIA DGX-1 system uses up to 8 Tesla P100 boards and costs $129,000 US. The system includes the following specifications:

  • Up to 170 teraflops of half-precision (FP16) peak performance
  • Eight Tesla P100 GPU accelerators, 16GB memory per GPU
  • NVLink Hybrid Cube Mesh
  • 7TB SSD DL Cache
  • Dual 10GbE, Quad InfiniBand 100Gb networking
  • 3U – 3200W

NVIDIA Pascal GP100 With Tesla P100 Graphics Board Benchmarks (Image Credits: Ambermd)

The following tests are too small to  effectively scale to multiple modern GPUs and since we are looking at pre-release hardware, NVLINK isn’t fine tuned to make use of all Tesla P100 hardware (Up To 4 in the benchmarks provided below).

NVIDIA Pascal GP100 With Tesla P100 Graphics Board Benchmarks (Image Credits: Ambermd)

For those expecting gaming benchmarks, we already made it clear that these results have nothing to do with general application performance. These workloads are specific to the HPC sector and that’s what the GP100 GPU has been designed to handle. We have heard rumors that NVIDIA is preparing a more cost effective 16 FinFET based GP102 GPU which might launched later this year as a flagship Titan product with similar specs as the Tesla P100. We don’t have any confirmation but we will update you as more news comes our way.

NVIDIA Pascal Tesla P100 Specs:

NVIDIA Tesla Graphics CardTesla K40
Tesla M40
Tesla P100
Tesla P100
Tesla P100 (Mezzanine)
GPUGK110 (Kepler)GM200 (Maxwell)GP100 (Pascal)GP100 (Pascal)GP100 (Pascal)
Process Node28nm28nm16nm16nm16nm
Transistors7.1 Billion8 Billion15.3 Billion15.3 Billion15.3 Billion
GPU Die Size551 mm2601 mm2610 mm2610 mm2610 mm2
CUDA Cores Per SM192128646464
CUDA Cores (Total)28803072358435843584
FP64 CUDA Cores / SM644323232
FP64 CUDA Cores / GPU96096179217921792
Base Clock745 MHz948 MHzTBDTBD1328 MHz
Boost Clock875 MHz1114 MHz1300MHz1300MHz1480 MHz
FP64 Compute1.68 TFLOPs0.2 TFLOPs4.7 TFLOPs4.7 TFLOPs5.30 TFLOPs
Texture Units240192224224224
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM24096-bit HBM24096-bit HBM2
Memory Size12 GB GDDR524 GB GDDR512 GB HBM216 GB HBM216 GB HBM2
L2 Cache Size1536 KB3072 KB4096 KB4096 KB4096 KB
Share Tweet Submit