NVIDIA Pascal GP100 GPU Benchmarks Unveiled – Tesla P100 Is The Fastest Graphics Card Ever Created For Hyperscale Computing

The first benchmark results of the NVIDIA GP100 GPU accelerator have been revealed (via Exxact Corp). Featured on the Tesla P100 graphics board, the GP100 GPU is directed at hyperscale servers and high-performance computing (HPC) in general. The Tesla P100 is already shipping to NVIDIA's priority customers which includes super computing companies and one of the organization decided to shows us the first performance numbers of the GP100 GPU. (Source: PCGamesHardware via Videocardz)

At GTC 2016, NVIDIA announced the Tesla P100, their most advanced hyperscale GPU to date.

NVIDIA Tesla P100 Accelerator Benched in HPC Workloads - GP100 GPU First Tests Unveiled

The benchmarks we will be looking at are from a tool known as AMBER which stands for Assisted Model Building with Energy Refinement. This tool was co-developed by Ross Walker from San Diego Supercomputer Center and Scott Le Grand from Amazon Web Services. Amber has two uses, it simulates how force fields are used to affect biomolecules. It also contains package of molecular simulation programs such as source codes and demos.

"Amber" refers to two things: a set of molecular mechanical force fields for the simulation of biomolecules (which are in the public domain, and are used in a variety of simulation programs); and a package of molecular simulation programswhich includes source code and demos. Amber is distributed in two parts: AmberTools16 and Amber16. via Ambermd.org

All of these benchmarks are part of HPC simulations and have nothing do with general performance in gaming applications. It provides us an overview of how well the GP100 GPU performs in such tasks against a range of other NVIDIA GPUs such as GP104, GM200 and GK110. The following configuration was used in the benchmark run:

Exxact AMBER Certified 2U GPU Workstation:

  • CPU = Dual x 8 Core Intel E5-2650v3 (2.3GHz), 64 GB DDR4 Ram
  • (note the cheaper 6 Core E5-2620v3 and v4 CPUs would also give the same performance for GPU runs)
  • MPICH v3.1.4 - GNU v4.8.5 - Centos 7.2
  • CUDA Toolkit NVCC v7.5 (8.0RC1 for GTX-1080 and P100)
  • NVIDIA Driver Linux 64 - 361.43
  • Precision Model = SPFP (GPU), Double Precision (CPU)

Now there's a few things to note before we look at the benchmarks. The tests were conducted on SPFP (GPU) precision model. This means that all GPUs used their single precision throughput to conduct these benchmarks while the CPUs were ran in double precision (FP64) model. It should also be mentioned that these tests were conducted at the time when both Tesla P100 and GTX 1080 were not publicly launched. So the question arises, how did Amber managed to get these cards before their announcement?

Amber works really close to NVIDIA and since the program was developed and written with NVIDIA's help to accelerate research based simulations, Amber managed to get first hand access on these cards. This however means that both cards are in engineering phase and should not be compared to the final retail versions whose performance should be better optimized. The NVIDIA Pascal GPUs also ran a pre-release version of CUDA 8.0.

At the time of writing GTX-1080 and P100 (DGX-1) cards had not been publically released. The benchmarks here are from pre-release hardware. As such they represent a bottom end to the performance. It is hoped that with access to released hardware that optimization of AMBER 16 specific to the Pascal architecture will be possible resulting in improved performance. (Pascal hardware benchmarks made use of a pre-release version of CUDA 8.0). So without further a do, let's take a look at the benchmarks:

NVIDIA Tesla P100 GP100 GPU Benchmarks:

In the benchmarks provided below, we can see that a single Tesla P100 is giving enough throughput to out perform a quad Titan X configuration. We also note that in some cases, the GeForce GTX 1080 is around as fast as the GP100 GPU which is due to the fact that GP104 is also a 9.3 TFLOPs graphics chip which is close to the 10 (10.6) TFLOPs output of the Tesla P100 accelerator. That changes when multiple boards are used. Tesla P100 is fastest without a doubt but with proper implementation of NVLINK in the final models which are now shipping to customers, we can see even bigger gains.

The NVIDIA DGX-1 is a supercomputing rack capable of delivering up to 170 TFLOPs of compute performance.

The NVIDIA DGX-1 system uses up to 8 Tesla P100 boards and costs $129,000 US. The system includes the following specifications:

  • Up to 170 teraflops of half-precision (FP16) peak performance
  • Eight Tesla P100 GPU accelerators, 16GB memory per GPU
  • NVLink Hybrid Cube Mesh
  • 7TB SSD DL Cache
  • Dual 10GbE, Quad InfiniBand 100Gb networking
  • 3U – 3200W

NVIDIA Pascal GP100 With Tesla P100 Graphics Board Benchmarks (Image Credits: Ambermd)

The following tests are too small to  effectively scale to multiple modern GPUs and since we are looking at pre-release hardware, NVLINK isn't fine tuned to make use of all Tesla P100 hardware (Up To 4 in the benchmarks provided below).


NVIDIA Pascal GP100 With Tesla P100 Graphics Board Benchmarks (Image Credits: Ambermd)

For those expecting gaming benchmarks, we already made it clear that these results have nothing to do with general application performance. These workloads are specific to the HPC sector and that's what the GP100 GPU has been designed to handle. We have heard rumors that NVIDIA is preparing a more cost effective 16 FinFET based GP102 GPU which might launched later this year as a flagship Titan product with similar specs as the Tesla P100. We don't have any confirmation but we will update you as more news comes our way.

NVIDIA Volta Tesla V100S Specs:

NVIDIA Tesla Graphics CardTesla K40
Tesla M40
Tesla P100
Tesla P100 (SXM2)Tesla V100 (PCI-Express)Tesla V100 (SXM2)Tesla V100S (PCIe)
GPUGK110 (Kepler)GM200 (Maxwell)GP100 (Pascal)GP100 (Pascal)GV100 (Volta)GV100 (Volta)GV100 (Volta)
Process Node28nm28nm16nm16nm12nm12nm12nm
Transistors7.1 Billion8 Billion15.3 Billion15.3 Billion21.1 Billion21.1 Billion21.1 Billion
GPU Die Size551 mm2601 mm2610 mm2610 mm2815mm2815mm2815mm2
CUDA Cores Per SM1921286464646464
CUDA Cores (Total)2880307235843584512051205120
Texture Units240192224224320320320
FP64 CUDA Cores / SM6443232323232
FP64 CUDA Cores / GPU9609617921792256025602560
Base Clock745 MHz948 MHz1190 MHz1328 MHz1230 MHz1297 MHzTBD
Boost Clock875 MHz1114 MHz1329MHz1480 MHz1380 MHz1530 MHz1601 MHz
FP16 ComputeN/AN/A18.7 TFLOPs21.2 TFLOPs28.0 TFLOPs30.4 TFLOPs32.8 TFLOPs
FP32 Compute5.04 TFLOPs6.8 TFLOPs10.0 TFLOPs10.6 TFLOPs14.0 TFLOPs15.7 TFLOPs16.4 TFLOPs
FP64 Compute1.68 TFLOPs0.2 TFLOPs4.7 TFLOPs5.30 TFLOPs7.0 TFLOPs7.80 TFLOPs8.2 TFLOPs
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM24096-bit HBM
Memory Size12 GB GDDR5 @ 288 GB/s24 GB GDDR5 @ 288 GB/s16 GB HBM2 @ 732 GB/s
12 GB HBM2 @ 549 GB/s
16 GB HBM2 @ 732 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 900 GB/s16 GB HBM2 @ 1134 GB/s
L2 Cache Size1536 KB3072 KB4096 KB4096 KB6144 KB6144 KB6144 KB
WccfTech Tv
Filter videos by