Nvidia Volta GPU Launch Confirmed For 2017 – Coming First To Supercomputers, Features HBM2 & GDDR6 Memory
Nvidia has confirmed that its “Pascal” GPU architecture will launch in 2016 and that its “Volta” GPU architecture succeed it in 2017. Volta powered supercomputers are expected to be operational by the middle of 2017. It’s Nvidia’s sixth generation of General Purpose GPU architectures since the introduction of the company’s first unified shader graphics architecture code named Tesla. Which debuted with the company’s highly successful GeForce 8 – 8000 series – back in 2006.
Volta was originally intended to succeed Nvidia’s 900 series Maxwell GPU architecture in 2016. It was originally going to be the company’s first generation to feature stacked memory. However Volta was designed with HMC , the Hybrid Memory Cube, in mind. Unfortunately,HMC hadn’t matured as quickly as Nvidia had hoped. So a replacement was put in place that makes use of the other major stacked memory standard avilable in the market, High Bandwidth Memory or HBM for short. And thus Pascal was born.
|GPU Architecture||NVIDIA Fermi||NVIDIA Kepler||NVIDIA Maxwell||NVIDIA Pascal|
|GPU Process||40nm||28nm||28nm||16nm (TSMC FinFET)|
|GPU Design||SM (Streaming Multiprocessor)||SMX (Streaming Multiprocessor)||SMM (Streaming Multiprocessor Maxwell)||SMP (Streaming Multiprocessor Pascal)|
|Maximum Transistors||3.00 Billion||7.08 Billion||8.00 Billion||15.3 Billion|
|Maximum Die Size||520mm2||561mm2||601mm2||610mm2|
|Stream Processors Per Compute Unit||32 SPs||192 SPs||128 SPs||64 SPs|
|Maximum CUDA Cores||512 CCs (16 CUs)||2880 CCs (15 CUs)||3072 CCs (24 CUs)||3840 CCs (60 CUs)|
|FP32 Compute||1.33 TFLOPs(Tesla)||5.10 TFLOPs (Tesla)||6.10 TFLOPs (Tesla)||~12 TFLOPs (Tesla)|
|FP64 Compute||0.66 TFLOPs (Tesla)||1.43 TFLOPs (Tesla)||0.20 TFLOPs (Tesla)||~6 TFLOPs(Tesla)|
|Maximum VRAM||1.5 GB GDDR5||6 GB GDDR5||12 GB GDDR5||16 / 32 GB HBM2|
|Maximum Bandwidth||192 GB/s||336 GB/s||336 GB/s||720 GB/s - 1 TB/s|
|Launch Year||2010 (GTX 580)||2014 (GTX Titan Black)||2015 (GTX Titan X)||2016|
Nvidia Volta GPUs To Feature GDDR6 Memory & HBM2
One of the major architectural overhauls that will be implemented in the Volta architecture is going to be to its memory sub-system. Gaming GeForce GTX 1100 series Volta graphics cards will feature sixth generation graphics DDR memory. GDDR6 will feature 14-16Gbps clock speeds, which is double that of GDDR5 and a good chunk ahead of GDDR5X.
Professional Volta SKUs under the Tesla brand will continue to use High Bandwidth Memory. Although notably they will be upgraded to the second generation of the technology. HBM2 uses less power and features double the clock speed of HBM1. Which translates to double the memory bandwidth at less power. Apart from an update SM design Volta will still be manufactured on TSMC’s 16nm FinFET process. Although, it will debut on a much more mature flavor of the process with even higher frequencies and better power efficiency. All of that is going to be of paramount importance in the final equation to delivering more performance and better efficiency in the gaming graphics market as well as the high performance computing and server markets.
Nvidia’s flagship Pascal GP100 GPU :
- Pascal graphics architecture.
- 2x performance per watt estimated improvement over Maxwell.
- To launch in 2016, purportedly the second half of the year.
- DirectX 12 feature level 12_1.
- Successor to the GM200 GPU found in the GTX Titan X and GTX 980 Ti.
- Built on the 16nm FinFET manufacturing process from TSMC.
- Has a total of 17 billion transistors, more than twice that of GM200.
- Features four 4-Hi HBM2 stacks, for a total of 16GB of VRAM.
- Features a 4096-bit memory bus interface, same as AMD’s Fiji GPU power the Fury series.
- Features NVLink (only compatible with next generation IBM PowerPC server processors)
- Supports half precision FP16 compute at twice the rate of full precision FP32.
Nvidia Confirms Volta Coming In 2017
While admittedly HMC has shown much slower progress compared to HBM which is already being used in AMD’s latest GPU code named Fiji, HMC still offers some substantial benefits for the server and HPC markets. And that’s where Volta is set to shine.
Nvidia plans to introduce Volta in a range of consumer graphics cards by 2018 and to use Volta GPUs to power some really exciting and highly power efficient next generation supercomputers in 2017.
The Summit from Oak Ridge National Laboratory and Sierra from Lawrence Livermore National Laboratory supercomputers will ba major headliners in 2017. Both of these supercomputers have one thing in common, they will be powered by next generation IBM POWER9 CPUs and NVIDIA Volta GPUs.
Summit is rated at a peak single precision floating point performance of 150-300 PFLOPS. Which will be delivered by more than 3400 compute nodes. Each node powered by several next generation IBM POWER9 CPUs and NVIDIA Volta based Tesla accelerators. Each node will deliver around 40 teraflops of compute and is touted as a more performent solution than an entire rack of flagship Haswell based server chips.
There’s one technology that will be pivotal to delivering the promise of Volta GPGPUs in servers and supercomputers, and that’s NVLINK. This technology is aimed at GPU accelerated servers and supercomputers where the inter-chip communication is extremely bandwidth limited and a major system bottleneck. Nvidia states that NV-Link will be up to 5 to 12 times faster than traditional PCIE 3.0 making it a major step forward in platform atomics. Earlier this year Nvidia announced that IBM will be integrating this new interconnect into its upcoming PowerPC server CPUs.
Nvidia NVLink Technology
NVLink is an energy-efficient, high-bandwidth communications channel that uses up to three times less energy to move data on the node at speeds 5-12 times conventional PCIe Gen3 x16. First available in the NVIDIA Pascal GPU architecture, NVLink enables fast communication between the CPU and the GPU, or between multiple GPUs. Figure 3: NVLink is a key building block in the compute node of Summit and Sierra supercomputers.
VOLTA GPU Featuring NVLINK and Stacked Memory NVLINK GPU high speed interconnect 80-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit.
NVLink is a key technology in Summit’s and Sierra’s server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to access each other’s memory fast and seamlessly. From a programmer’s perspective, NVLink erases the visible distinctions of data separately attached to the CPU and the GPU by “merging” the memory systems of the CPU and the GPU with a high-speed interconnect. Because both CPU and GPU have their own memory controllers, the underlying memory systems can be optimized differently (the GPU’s for bandwidth, the CPU’s for latency) while still presenting as a unified memory system to both processors. NVLink offers two distinct benefits for HPC customers. First, it delivers improved application performance, simply by virtue of greatly increased bandwidth between elements of the node. Second, NVLink with Unified Memory technology allows developers to write code much more seamlessly and still achieve high performance. via NVIDIA News
NVLink will debut with Nvidia’s Pascal in 2016 before it makes its way to Volta in 2018. And unlike Maxwell, Nvidia has laid major focus on compute and GPGPU acceleration with Pascal. The slew of features and new technologies that Nvidia will debut with Pascal emphasize this focus. Including the use of next generation stacked High Bandwidth Memory, high-speed NVLink GPU interconnect and support of mixed precision for the acceleration of mobile applications to push on mobile perf/watt. We expect that Volta will carry all of these forward.
Back to the Summit supercomputer, perhaps the most impressive thing about it is that it will consume 10% more power than the Titan supercomputer and in exchange deliver up to 10 times the computational performance. While Titan is rated at 25-30 PETAFLOPs, Sierra will deliver >100 PFlops of compute and Summit will deliver an even more impressive 150-300 PFlops of compute.