NVIDIA Volta GV100 GPU Chip For Summit Supercomputer Twice as Fast as Pascal P100 – Speculated To Hit 9.5 TFLOPs FP64 Compute

Author Photo
Dec 20, 2016

NVIDIA Volta is being prepped for launch in the next generation supercomputers known as Summit and Sierra. Little is known about Volta GPU specifications but an analysis down by NextPlatform over the details of Summit supercomputer reveal that it can be an insanely fast chips capable of delivering multi-tflops compute power in the HPC market.

NVIDIA Volta VG100 GPUs – The Heart of the Summit and Sierra Supercomputer, Multi-TFLOPs Chip With Fastest HBM2 Configuration

When NVIDIA announced their Pascal GP100 GPU at GTC 2016, they called it the largest chip endeavor in the history of humanity. With a R&D budget of over several Billion dollars, Pascal GP100 was indeed the great chip of 2016, aimed to power the HPC and datacenter market with performance never before seen in the graphics industry. NVIDIA also utilized Pascal GP100 GPUs inside their own DGX SaturnV supercomputer that is designed to help them build smarter cards and next generation GPUs (GPUs Designing GPUs).

Just a year after their successful Pascal launch in the HPC market, NVIDIA will be planning to introduce their next grand chip for the HPC market, codenamed Volta. Details of the chip first emerged back at GTC 2015 where NVIDIA showcased what the predict to be the estimated performance output of their upcoming chips. Do note that Pascal was not launched at that time. According to the slides presented that day, Volta would have twice of everything that Pascal has. Double the memory capacity, double the compute, higher efficiency and faster bandwidth.

We aren’t sure how much of that may end up being true but what NVIDIA estimated for Pascal was close to the final product (if not entirely the same). The only thing that Pascal currently lacks is the promised 32 GB capacity but that’s mostly an issue due to HBM production which has already ramped up and we can expect a full GP100 configuration with 32 GB capacity since that is entirely possible with the chip design. In short, VRAM limitation is due to production, not the chip design.

Summit Supercomputer Latest Details Provide First Glimpse of NVIDIA Volta VG100 GPU Specs

The latest details for the Summit Supercomputer have been confirmed and they are incredible if we look from a HPC perspective. The Summit Supercomputer has 5-10x improvement in application performance over the Titan supercomputer that featured the Kepler GK110 GPU architecture. The Titan was comprised of 18,688 nodes rated at 1.4 TF (per node). The Summit features around 4,600 nodes with a rated compute output of over 40 TF (per node).

Specifications comparison of Titan and Summit Supercomputer. (Image Credits: The Next Platform)

There’s 512 GB of DDR4 and additional HBM2 memory on each node. Titan in comparison had just 38 GB of DDR3 and 6 GB GDDR5 (per GPU) memory on each node. There’s also total of 800 GB NV memory per node. In total, the memory on Titan supercomputer was 710 TB, Summit peaks at over 6 Petabytes of memory (all DDR4 + HBM2 + Non-Volatile combined).

The Power9 chips will have 48 lanes of PCI-Express 4.0 peripheral I/O per socket, for an aggregate of 192 GB/sec of duplex bandwidth, as well as 48 lanes of 25 Gb/sec “Bluelink” connectivity, with an aggregate bandwidth of 300 GB/sec for linking various kinds of accelerators. These Bluelink ports are used to run the NVLink 2.0 protocol that will be supported on the Volta GPUs from Nvidia, and which have about 56 percent more bandwidth than the PCI-Express ports. IBM could support a lot of the SMX2-style, on-motherboard Tesla cards in a system, given all of these Bluelink ports, but remember it needs to allow the Volta accelerators to link to each other over NVLink so they can share memory as well as using NVLink to share memory back with the two Power9 chips. via The Next Platform

Each node will house 2 IBM Power9 CPUs and 6 NVIDIA Volta V100 GPUs. NVIDIA’s NVLINK2 interconnect will be fully integrated between these nodes. The system would consume 13 MW peak power which is just 4 MW more than the Titan supercomputer (9 MW) for over 10x the performance improvement.

NVIDIA Volta Tesla V100 – The Next-Generation Compute Powerhouse

NVIDIA previously stated through their roadmaps that NVIDIA Volta GV100 GPUs will deliver SGEMM (Single precision floating General Matrix Multiply) of 72 GFLOPS/Watt compared to 42 GFLOPs/Watt on Pascal GP100. Using the mentioned ration, a Volta GV100 based GPU with a TDP of 300W can theoretically deliver 9.5 TFLOPs of double precision performance, almost twice that of the current generation GP100 GPU. NVIDIA’s Tesla P100 cards also ship at 300W but the nodes are expected to feature around 40 TFLOPs of compute performance so it is possible that NVIDIA may use TDP configured variants for the Summit supercomputer.

Since six Volta V100 GPUs with a rated 300W TDP will go beyond the 40 TF node barrier, delivering around 57.2 TFLOPs which isn’t as claimed in the Summit specs sheet. A geared down version that runs with a TDP around 200W will manage 20-25% lower performance and deliver 7.6 TFLOPs and 38.2 GFLOPs/Watt which aligns with the Summit node specs.

Six of these Volta Tesla V100 GPUs can run with 45 TFLOPs compute which sounds more possible. There’s possibility that the final dual precision compute of Volta V100 may end up near 8-9 TFLOPs which would be an impressive feat for the graphics manufacturer.

Summit Supercomputer Specifications:

Supercomputer Titan Summit
Number of Nodes 18688 4608
Processors 1 Opteron
1 Kepler K20X
2 IBM Power9
6 NVIDIA Tesla V100
GPUs 18688 NVIDIA Tesla K20X 27648 NVIDIA Telsa V100
CPUs 18688 Opteron CPUs 9216 Power9 CPUs
Node Performance 1.44 TF 49 TF
Peak Performance 27 PF 200 PF
Peak OPs (Tensor) N/A 3.3 ExaOps
Memory Per Node 38 GB DDR3 + 6 GB GDDR5 512 GB DDR4 + HBM2 (16/32 GB) + NVDIMM
NV Memory Per Node 0 800 GB (Flash based)
Total System Memory 710 TB 10 PB
System Interconnect Gemini (6.4 GB/s)
PCIe 8 GB/s
Dual Rail EDR-IB (23 GB/s) / Dual Rail HDR-IB (48 GB/s)

Interconnect Topology 3D Tours Non-Blocking Fat Free
File System 32 PB, 1 TB/s Lustre 250 PC, 2.5 TB/s, GPFS
Peak Power Input 9 MW 13 MW

Furthermore, Volta GV100 may ship or exceed the promised 32 GB HBM2 capacity of Pascal GPUs and have bandwidth tuned around 1 TB/s. NVIDIA slides from GTC 2015 claim bandwidths of ~900 GB/s while Pascal currently operates with 732 GB/s.

The Looming Memory Crisis With HBM2

On further explaining the next generation GPU architectures and efficiency, Stephen W.Keckler (Senior Director of GPU Architecture) pointed out that HBM is a great memory architecture which will be implemented across Pascal and Volta chips but those chips have max bandwidth of 1.2 TB/s (Volta GPU). Moving forward, there exists a looming memory power crisis. HBM2 at 1.2 TB/s sure is great but it adds 60W to the power envelope on a standard GPU.

The current implementation of HBM1 on Fiji chips adds around 25W to the chip. Moving onwards, chips with access of 2 TB/s bandwidth will increase the overall power limit on chips which will go from worse to breaking point. A chip with 2.5 TB/s HBM (2nd generation) memory will reach a 120W TDP for the memory architecture alone, a 1.5 times efficient HBM 2 architecture that outputs over 3 TB/s bandwidth will need 160W to feed the memory alone.

NVIDIA HBM Memory Crisis

This is not the power of the whole chip mentioned but just the memory layout, typically, these chips will be considered non-efficient for the consumer and HPC sectors but NVIDIA is trying to change that and is exploring new means to solve the memory power crisis that exists ahead with HBM and higher bandwidth. In the near future, Pascal and Volta don’t see a major consumption increase from HBM but moving onward in 2020, when NVIDIA’s next gen architecture is expected to arrive, we will probably see a new memory architecture being introduced to solve the increased power needs.

GPU Family AMD Vega AMD Navi NVIDIA Pascal NVIDIA Volta
Flagship GPU Vega 10 Navi 10 NVIDIA GP100 NVIDIA GV100
GPU Process 14nm FinFET 7nm FinFET TSMC 16nm FinFET TSMC 12nm FinFET
GPU Transistors 15-18 Billion TBC 15.3 Billion 21.1 Billion
GPU Cores (Max) 4096 SPs TBC 3840 CUDA Cores 5376 CUDA Cores
Peak FP32 Compute 13.0 TFLOPs TBC 12.0 TFLOPs >15.0 TFLOPs (Full Die)
Peak FP16 Compute 25.0 TFLOPs TBC 24.0 TFLOPs 120 Tensor TFLOPs
Memory (Consumer Cards) HBM2 HBM3 GDDR5X GDDR6
Memory (Dual-Chip Professional/ HPC) HBM2 HBM3 HBM2 HBM2
HBM2 Bandwidth 484 GB/s (Frontier Edition) >1 TB/s? 732 GB/s (Peak) 900 GB/s
Graphics Architecture Next Compute Unit (Vega) Next Compute Unit (Navi) 5th Gen Pascal CUDA 6th Gen Volta CUDA
Successor of (GPU) Radeon RX 500 Series Radeon RX 600 Series GM200 (Maxwell) GP100 (Pascal)
Launch 2017 2019 2016 2017

With the final configuration of Volta V100 and IBM Power9 CPUs in place, the Summit Supercomputer would be ranked as the top performing machine in the world with performance crossing the 250 Petaflops mark.