NVIDIA Pascal GPU Still Aimed At 2016 Launch – 10x More Performance Compared To Maxwell With FP16, NVLINK and 3D Memory


NVIDIA has updated their next generation graphics roadmap at the GTC 2015 conference with the upcoming Pascal GPU and Volta GPU. While we are still a few years away from knowing what the Volta GPU would look and perform like, CEO of NVIDIA, Jen-Hsun Huang did confirm three key aspects of the Pascal GPU which will make it 10 times faster than current generation Maxwell based chips when launched in 2016.

NVIDIA Pascal Gets FP16 Mixed Precision, NVLINK and 1 TB/s 3D Memory in 2016

The details shown by NVIDIA on Pascal GPU are pretty much the same things we heard last year at GTC 2014 with a few updates on performance and efficiency bits. We know that NVIDIA’s Pascal GPU would replace Maxwell going in 2016 and would feature the latest core architecture from NVIDIA that will use the 3D Stacked memory that enables memory to be stacked on the GPU die and enable bandwidth speeds of upto 1 TB/s. This 3D chip on wafer integration will not only enable much more BW (bandwidth) but will also deliver upto 4 times the efficiency and 2.5 times more VRAM capacity of the graphics unit to deliver amazing performance on higher resolution screens. AMD is already going for 2.5D memory stacking with their upcoming cards which will have up to 640 GB/s bandwidth and NVIDIA will have 3D HBM integration that will enable tons of memory chips to be stacked with greater than TB/s bandwidth.

Compared to the GeForce GTX Titan X, 3D HBM memory will allow three times more bandwidth since the Titan X have already received the highest standard GDDR5 memory chips capable of 7 GHz frequency. This limitation will end once HBM becomes common on discrete graphics cards. They also mentioned Pascal having 2.7 times more memory available which points out to 32 GB VRAM to users with higher demand. Compared to Pascal, the Titan X only has 12 GB GDDR5 memory which is considered a lot by users.

The Pascal GPU would also introduce NVLINK which is the next generation Unified Virtual Memory link with Gen 2.0 Cache coherency features and 5 – 12 times the bandwidth of a regular PCIe connection. This will solve many of the bandwidth issues that high performance GPUs currently face. One of the latest things we learned about NVLINK is that it will allow several GPUs to be connected in parallel, whether in SLI for gaming or for professional usage. Jen-Hsun specifically mentioned that instead of 4 cards, users will be able to use 8 GPUs in their PCs for gaming and professional purposes.

NVLink is an energy-efficient, high-bandwidth communications channel that uses up to three times less energy to move data on the node at speeds 5-12 times conventional PCIe Gen3 x16. First available in the NVIDIA Pascal GPU architecture, NVLink enables fast communication between the CPU and the GPU, or between multiple GPUs. Figure 3: NVLink is a key building block in the compute node of Summit and Sierra supercomputers.

VOLTA GPU Featuring NVLINK and Stacked Memory NVLINK GPU high speed interconnect 80-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit.

NVLink is a key technology in Summit’s and Sierra’s server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to access each other’s memory fast and seamlessly. From a programmer’s perspective, NVLink erases the visible distinctions of data separately attached to the CPU and the GPU by “merging” the memory systems of the CPU and the GPU with a high-speed interconnect. Because both CPU and GPU have their own memory controllers, the underlying memory systems can be optimized differently (the GPU’s for bandwidth, the CPU’s for latency) while still presenting as a unified memory system to both processors. NVLink offers two distinct benefits for HPC customers. First, it delivers improved application performance, simply by virtue of greatly increased bandwidth between elements of the node. Second, NVLink with Unified Memory technology allows developers to write code much more seamlessly and still achieve high performance. via NVIDIA News

The third thing Jen-Hsun mentioned is how he believes Pascal GPU will be able to achieve 10x better performance compared to Maxwell. The key to this improvement is mixed precision or FP16 compute which NVIDIA recently switched inside their Tegra X1 SOC.

NVIDIA GTC 2015 Pascal GPU Slides:

  • 3D Memory: Stacks DRAM chips into dense modules with wide interfaces, and brings them inside the same package as the GPU. This lets GPUs get data from memory more quickly – boosting throughput and efficiency – allowing us to build more compact GPUs that put more power into smaller devices. The result: several times greater bandwidth, more than twice the memory capacity and quadrupled energy efficiency.
  • Unified Memory: This will make building applications that take advantage of what both GPUs and CPUs can do quicker and easier by allowing the CPU to access the GPU’s memory, and the GPU to access the CPU’s memory, so developers don’t have to allocate resources between the two.
  • NVLink: Today’s computers are constrained by the speed at which data can move between the CPU and GPU. NVLink puts a fatter pipe between the CPU and GPU, allowing data to flow at more than 80GB per second, compared to the 16GB per second available now.
  • Pascal Module: NVIDIA has designed a module to house Pascal GPUs with NVLink. At one-third the size of the standard boards used today, they’ll put the power of GPUs into more compact form factors than ever before.

Mixed-Precision Computing for Greater Accuracy

Mixed-precision computing enables Pascal architecture-based GPUs to compute at 16-bit floating point accuracy at twice the rate of 32-bit floating point accuracy.

Increased floating point performance particularly benefits classification and convolution – two key activities in deep learning – while achieving needed accuracy.

3D Memory for Faster Communication Speed and Power Efficiency

Memory bandwidth constraints limit the speed at which data can be delivered to the GPU. The introduction of 3D memory will provide 3X the bandwidth and nearly 3X the frame buffer capacity of Maxwell. This will let developers build even larger neural networks and accelerate the bandwidth-intensive portions of deep learning training.

Pascal will have its memory chips stacked on top of each other, and placed adjacent to the GPU, rather than further down the processor boards. This reduces from inches to millimeters the distance that bits need to travel as they traverse from memory to GPU and back. The result is dramatically accelerated communication and improved power efficiency.

NVLink – for Faster Data Movement

The addition of NVLink to Pascal will let data move between GPUs and CPUs five to 12 times faster than they can with today’s current standard, PCI-Express. This is greatly benefits applications, such as deep learning, that have high inter-GPU communication needs.

NVLink allows for double the number of GPUs in a system to work together in deep learning computations. In addition, CPUs and GPUs can connect in new ways to enable more flexibility and energy efficiency in server design compared to PCI-E.

NVIDIA Pascal GPU GTC 2015:NVIDIA Pascal GPU Chip Module