NVIDIA Volta GV100 GPU Chip For Summit Supercomputer Twice as Fast as Pascal P100 – Speculated To Hit 9.5 TFLOPs FP64 Compute

Hassan Mujtaba

NVIDIA Volta is being prepped for launch in the next generation supercomputers known as Summit and Sierra. Little is known about Volta GPU specifications but an analysis down by NextPlatform over the details of Summit supercomputer reveal that it can be an insanely fast chips capable of delivering multi-tflops compute power in the HPC market.

NVIDIA Volta VG100 GPUs - The Heart of the Summit and Sierra Supercomputer, Multi-TFLOPs Chip With Fastest HBM2 Configuration

When NVIDIA announced their Pascal GP100 GPU at GTC 2016, they called it the largest chip endeavor in the history of humanity. With a R&D budget of over several Billion dollars, Pascal GP100 was indeed the great chip of 2016, aimed to power the HPC and datacenter market with performance never before seen in the graphics industry. NVIDIA also utilized Pascal GP100 GPUs inside their own DGX SaturnV supercomputer that is designed to help them build smarter cards and next generation GPUs (GPUs Designing GPUs).

Just a year after their successful Pascal launch in the HPC market, NVIDIA will be planning to introduce their next grand chip for the HPC market, codenamed Volta. Details of the chip first emerged back at GTC 2015 where NVIDIA showcased what the predict to be the estimated performance output of their upcoming chips. Do note that Pascal was not launched at that time. According to the slides presented that day, Volta would have twice of everything that Pascal has. Double the memory capacity, double the compute, higher efficiency and faster bandwidth.


We aren't sure how much of that may end up being true but what NVIDIA estimated for Pascal was close to the final product (if not entirely the same). The only thing that Pascal currently lacks is the promised 32 GB capacity but that's mostly an issue due to HBM production which has already ramped up and we can expect a full GP100 configuration with 32 GB capacity since that is entirely possible with the chip design. In short, VRAM limitation is due to production, not the chip design.

Summit Supercomputer Latest Details Provide First Glimpse of NVIDIA Volta VG100 GPU Specs

The latest details for the Summit Supercomputer have been confirmed and they are incredible if we look from a HPC perspective. The Summit Supercomputer has 5-10x improvement in application performance over the Titan supercomputer that featured the Kepler GK110 GPU architecture. The Titan was comprised of 18,688 nodes rated at 1.4 TF (per node). The Summit features around 4,600 nodes with a rated compute output of over 40 TF (per node).

Specifications comparison of Titan and Summit Supercomputer. (Image Credits: The Next Platform)

There's 512 GB of DDR4 and additional HBM2 memory on each node. Titan in comparison had just 38 GB of DDR3 and 6 GB GDDR5 (per GPU) memory on each node. There's also total of 800 GB NV memory per node. In total, the memory on Titan supercomputer was 710 TB, Summit peaks at over 6 Petabytes of memory (all DDR4 + HBM2 + Non-Volatile combined).

The Power9 chips will have 48 lanes of PCI-Express 4.0 peripheral I/O per socket, for an aggregate of 192 GB/sec of duplex bandwidth, as well as 48 lanes of 25 Gb/sec “Bluelink” connectivity, with an aggregate bandwidth of 300 GB/sec for linking various kinds of accelerators. These Bluelink ports are used to run the NVLink 2.0 protocol that will be supported on the Volta GPUs from Nvidia, and which have about 56 percent more bandwidth than the PCI-Express ports. IBM could support a lot of the SMX2-style, on-motherboard Tesla cards in a system, given all of these Bluelink ports, but remember it needs to allow the Volta accelerators to link to each other over NVLink so they can share memory as well as using NVLink to share memory back with the two Power9 chips. via The Next Platform

Each node will house 2 IBM Power9 CPUs and 6 NVIDIA Volta V100 GPUs. NVIDIA's NVLINK2 interconnect will be fully integrated between these nodes. The system would consume 13 MW peak power which is just 4 MW more than the Titan supercomputer (9 MW) for over 10x the performance improvement.

NVIDIA Volta Tesla V100 - The Next-Generation Compute Powerhouse

NVIDIA previously stated through their roadmaps that NVIDIA Volta GV100 GPUs will deliver SGEMM (Single precision floating General Matrix Multiply) of 72 GFLOPS/Watt compared to 42 GFLOPs/Watt on Pascal GP100. Using the mentioned ration, a Volta GV100 based GPU with a TDP of 300W can theoretically deliver 9.5 TFLOPs of double precision performance, almost twice that of the current generation GP100 GPU. NVIDIA's Tesla P100 cards also ship at 300W but the nodes are expected to feature around 40 TFLOPs of compute performance so it is possible that NVIDIA may use TDP configured variants for the Summit supercomputer.

Since six Volta V100 GPUs with a rated 300W TDP will go beyond the 40 TF node barrier, delivering around 57.2 TFLOPs which isn't as claimed in the Summit specs sheet. A geared down version that runs with a TDP around 200W will manage 20-25% lower performance and deliver 7.6 TFLOPs and 38.2 GFLOPs/Watt which aligns with the Summit node specs.

Six of these Volta Tesla V100 GPUs can run with 45 TFLOPs compute which sounds more possible. There's possibility that the final dual precision compute of Volta V100 may end up near 8-9 TFLOPs which would be an impressive feat for the graphics manufacturer.

Summit Supercomputer Specifications:

Number of Nodes186884608
Processors1 Opteron
1 Kepler K20X
2 IBM Power9
6 NVIDIA Tesla V100
GPUs18688 NVIDIA Tesla K20X27648 NVIDIA Telsa V100
CPUs18688 Opteron CPUs9216 Power9 CPUs
Node Performance1.44 TF49 TF
Peak Performance27 PF200 PF
Peak OPs (Tensor)N/A3.3 ExaOps
Memory Per Node38 GB DDR3 + 6 GB GDDR5512 GB DDR4 + HBM2 (16/32 GB) + NVDIMM
NV Memory Per Node0800 GB (Flash based)
Total System Memory710 TB10 PB
System InterconnectGemini (6.4 GB/s)
PCIe 8 GB/s
Dual Rail EDR-IB (23 GB/s) / Dual Rail HDR-IB (48 GB/s)

Interconnect Topology3D ToursNon-Blocking Fat Free
File System32 PB, 1 TB/s Lustre250 PC, 2.5 TB/s, GPFS
Peak Power Input9 MW13 MW

Furthermore, Volta GV100 may ship or exceed the promised 32 GB HBM2 capacity of Pascal GPUs and have bandwidth tuned around 1 TB/s. NVIDIA slides from GTC 2015 claim bandwidths of ~900 GB/s while Pascal currently operates with 732 GB/s.

The Looming Memory Crisis With HBM2

On further explaining the next generation GPU architectures and efficiency, Stephen W.Keckler (Senior Director of GPU Architecture) pointed out that HBM is a great memory architecture which will be implemented across Pascal and Volta chips but those chips have max bandwidth of 1.2 TB/s (Volta GPU). Moving forward, there exists a looming memory power crisis. HBM2 at 1.2 TB/s sure is great but it adds 60W to the power envelope on a standard GPU.

The current implementation of HBM1 on Fiji chips adds around 25W to the chip. Moving onwards, chips with access of 2 TB/s bandwidth will increase the overall power limit on chips which will go from worse to breaking point. A chip with 2.5 TB/s HBM (2nd generation) memory will reach a 120W TDP for the memory architecture alone, a 1.5 times efficient HBM 2 architecture that outputs over 3 TB/s bandwidth will need 160W to feed the memory alone.

NVIDIA HBM Memory Crisis

This is not the power of the whole chip mentioned but just the memory layout, typically, these chips will be considered non-efficient for the consumer and HPC sectors but NVIDIA is trying to change that and is exploring new means to solve the memory power crisis that exists ahead with HBM and higher bandwidth. In the near future, Pascal and Volta don’t see a major consumption increase from HBM but moving onward in 2020, when NVIDIA’s next gen architecture is expected to arrive, we will probably see a new memory architecture being introduced to solve the increased power needs.

Flagship GPUVega 10Navi 10NVIDIA GP100NVIDIA GV100
GPU Process14nm FinFET7nm FinFETTSMC 16nm FinFETTSMC 12nm FinFET
GPU Transistors15-18 BillionTBC15.3 Billion21.1 Billion
GPU Cores (Max)4096 SPsTBC3840 CUDA Cores5376 CUDA Cores
Peak FP32 Compute13.0 TFLOPsTBC12.0 TFLOPs>15.0 TFLOPs (Full Die)
Peak FP16 Compute25.0 TFLOPsTBC24.0 TFLOPs120 Tensor TFLOPs
Memory (Consumer Cards)HBM2HBM3GDDR5XGDDR6
Memory (Dual-Chip Professional/ HPC)HBM2HBM3HBM2HBM2
HBM2 Bandwidth484 GB/s (Frontier Edition)>1 TB/s?732 GB/s (Peak)900 GB/s
Graphics ArchitectureNext Compute Unit (Vega)Next Compute Unit (Navi)5th Gen Pascal CUDA6th Gen Volta CUDA
Successor of (GPU)Radeon RX 500 SeriesRadeon RX 600 SeriesGM200 (Maxwell)GP100 (Pascal)

With the final configuration of Volta V100 and IBM Power9 CPUs in place, the Summit Supercomputer would be ranked as the top performing machine in the world with performance crossing the 250 Petaflops mark.

Deal of the Day