⋮    ⋮  

NVIDIA Pascal GPU’s Double Precision Performance Rated at Over 4 TFLOPs, 16nm FinFET Architecture Confirmed – Volta GPU Peaks at Over 7 TFLOPs, 1.2 TB/s HBM2


At this year's SC15, NVIDIA revealed and confirmed two major bits about their next generation Pascal GPUs. The information includes details regarding process design, peak compute performance and even shared the same numbers for their Volta GPUs which are expected to hit the market in 2018 (2017 for HPC). The details confirm the rumors which we have been hearing since a few months that Pascal might be coming in market earlier next year.

NVIDIA's Pascal and Volta GPUs Peak Compute Performance Revealed - Volta To Push Memory Bandwidth To 1.2 TB/s

For some time now, we have been hearing that NVIDIA's next generation Pascal GPUs will be based on a 16nm process. NVIDIA revealed or should we say, finally confirmed during their SC15 conference that the chip is based on a 16nm FinFET process, NVIDIA didn't reveal they name of the Semiconductor Foundry but it was confirmed that TSMC would be supplying the new GPUs. Now this might not be a significant bit as it has been known for months and we know that NVIDIA’s Pascal GP100 chip has already been taped out on TSMC’s 16nm FinFET process. This means that we can see a launch of these chips as early as 1H of 2016. Doubling of the transistor density would put Pascal to somewhere around 16-17 Billion transistors since Maxwell GPUs already feature 8 Billion transistors on the flagship GM200 GPU core.

TSMC’s 16FF+ (FinFET Plus) technology can provide above 65 percent higher speed, around 2 times the density, or 70 percent less power than its 28HPM technology. Comparing with 20SoC technology, 16FF+ provides extra 40% higher speed and 60% power saving. By leveraging the experience of 20SoC technology, TSMC 16FF+ shares the same metal backend process in order to quickly improve yield and demonstrate process maturity for time-to-market value.

Nvidia decided to let TSMC mass produce the Pascal GPU, which is scheduled to be released next year, using the production process of 16-nm FinFETs. Some in the industry predicted that both Samsung and TSMC would mass produce the Pascal GPU, but the U.S. firm chose only the Taiwanese firm in the end. Since the two foundries have different manufacturing process of 16-nm FinFETs, the U.S. tech company selected the world’s largest foundry (TSMC) for product consistency. (This quote was originally posted at BuisnessKorea however the article has since been removed due to confidential reasons).

What we know so far about the GP100 chip:

  • Pascal microarchitecture.
  • DirectX 12 feature level 12_1 or higher.
  • Successor to the GM200 GPU found in the GTX Titan X and GTX 980 Ti.
  • Built on the 16FF+ manufacturing process from TSMC.
  • Allegedly has a total of 17 billion transistors, more than twice that of GM200.
  • Taped out in June 2015.
  • Will feature four 4-Hi HBM2 stacks, for a total of 16GB of VRAM for the consumer variant and 32GB for the professional variant.
  • Features a 4096bit memory interface.
  • Features NVLink and support for Mixed Precision FP16 compute tasks at twice the rate of FP32 and full FP64 support. 2016 release.

Back at GTC 2015, NVIDIA's CEO Jen-Hsun Huang talked about mixed precision which allows users to get twice the compute performance in FP16 workloads compared to FP32 by computing at 16-bit with twice the accuracy of FP32. Pascal allows more than just that, it is capable of FP16, FP32 and FP64 compute and we have just learned the peak compute performance of Pascal in double precision workloads. With Pascal GPU, NVIDIA will return to the HPC market with new Tesla products. Maxwell, although great in all regards was deprived of necessary FP64 hardware and focused only on FP32 performance. This meant that the chip was going to stay away from HPC markets while NVIDIA offered their year old Kepler based cards as the only Tesla based options. AMD which is NVIDIA's only competitor in this HPC GPU department also made a similar approach with their Fiji GPUs which is a FP32 focused gaming part while the Hawaii GPU serves in the HPC space, offering double precision compute.

Spending a lot of energy in the computation units and dedicating a lot of energy doing double precision and arithmetic when you need it is great but when you don't need it, there's a lot left on the table  such as the un necessary power envelope that goes under utilized, reducing the efficiency of the overall systems. If you can survive with single precision or even half precision, you can gain significant improvements in energy efficiency and that is why mixed precision matters most as told by Senior Director of Architecture at NVIDIA, Stephen W. Keckler.

Pascal is designed to be NVIDIA's greatest HPC offering that incorporates the latest NVLINK standard and offers a UVM (Unified Virtual Memory) addressing inside a heterogeneous node. The Pascal GPU would be the first to introduce NVLINK which is the next generation Unified Virtual Memory link with Gen 2.0 Cache coherency features and 5 – 12 times the bandwidth of a regular PCIe connection. This will solve many of the bandwidth issues that high performance GPUs currently face.

First technology we’ll announce today is an important invention called NVLink. It’s a chip-to-chip communication channel. The programming model is PCI Express but enables unified memory and moves 5-12 times faster than PCIe. “This is a big leap in solving this bottleneck,” Jen-Hsun says. NVIDIA

According to official NVIDIA slides, we are looking at a peak double precision compute performance of over 4 TFLOPs along with 1 TB/s HBM2 memory which will be amount to 32 GB VRAM in HPC parts. NVIDIA's current flagship, Tesla K80 accelerator which features two GK210 GPUs has a peak performance rated at 2.91 TFLOPs when running with boost clocks and just a little bit over 2 TFLOPs when running at the standard clock speeds. The single GK180 chip based, Tesla K40 has a double precision compute performance rated at 1.43 TFLOPs and AMD's best single chip FirePro card, the FirePro S9170 with 32 GB VRAM has the peak double precision (FP64) performance rated at 2.62 TFLOPs.

Built for Double Precision General Matrix Multiplication workloads, both Kepler and Hawaii chips were built for compute and while their successor kept things pretty silent on the FP64 end, they did come with better FP32 performance (Maxwell and Fiji). On compute side, Pascal is going to take the next incremental step with double precision performance rated over 4 TFLOPs, which is double of what's offered on the last generation FP64 enabled GPUs. As for single precision performance, we will see the Pascal GPUs breaking past the 10 TFLOPs barrier with ease.

NVIDIA also shared numbers for their Volta GPUs which will be rated at 7 TFLOPs (FP64) compute performance. This will be an incremental step in building multi-PFLOPs systems that will be integrated inside supercomputers from Oak Ridge National Laboratory (Summit Supercomputer) and Lawrence Livermore National Laboratory (Sierra Supercomputer). Both computers are rated at over 100 PFLOPs (peak performance) and will integrate several thousand nodes with over 40 TFLOPs performance per node. While talking about Exascale computing, NVIDIA's Chief Scientist and SVP of Research, Bill Dally gave a detailed explanation why energy efficiency is the main focus towards HPC:

NVIDIA Pascal Reference Chip With FPUs

So let me talk about the first gap, the energy efficiency gap. Now lot's of people say don't you need more     efficient floating point units? That's completely wrong, It's not about the flops. If I wanted to build an exascale machine today, I could take the same process technology we are using to build our Pascal chip, 16nm foundry process, 10mm on a side which is about a third the linear size and about 9th the area of Pascal, so the Pascal chip is way bigger than this, believing this is a 1cm on a side chip, if I pack it with floating point units which I drew it to scale you wouldn't see it, that little red dot is a little bigger than scale, a double precision fused multiply add (DFMA) unit and that's about 10 pJ/OP and can run at 2 GFLOPs. So if I fill this chip with floating point units and it consumed 200W, I get 20 TFLOPs on this one chip (100mm2 die). I put 50,000 of these inside racks, I have an Exascale Machine.

Of course, its an Exascale machine and its completely worthless because much like children or pets, floating point units are easy to get and hard to take care of. What's hard about floating point units is feeding them and taking care of what they produce, you know the results. It's moving the data to and forth that's hard, not building the arithmetic unit. via NVIDIA@SC15 Conference

The talks detailed that an exascale system that will be implemented in systems around 2023 will consist of several heterogeneous nodes, made up of several throughput optimized cores aka GPUs known as TOCs, Latency Optimized Cores aka CPUs known as LOCs and will consist of tight communication between them, the memory and caches to enable good programming models. The GPUs will do the bulk of heavy lifting while the CPUs will focus on sequential processing. The reason explained for this is that the CPUs have great vector core performance but when those vector cores aren't utilized, the scalar mode turns out to be pretty useless in HPC uses. The entire system will consist of large DRAM banks which will be connected inside a heterogeneous DRAM environment and will help solve two crucial problems on current generation systems, first is to exploit all available bandwidth on the system/node and second is to maximize the locality for frequently accessed data.

CPUs waste a lot of their energy in deciding what order to do the instruction in, that usually consists restricting, reorganizing, renaming the registers and a small fraction of energy actually is used to do the actual executions.

GPUs don't care about latency of an individual instruction, they can execute instructions through pipelines as quickly as possible. They don't have out of order execution or branch prediction and spend a lot more of the power budget on the actual execution. Some of the systems today have half of the energy go to actual system executions as opposed to very small amount of energy in past generations. The next generation GPUs will be able to utilize more of that energy to execute instructions.

On further explaining the next generation GPU architectures and efficiency, Stephen pointed out that HBM is a great memory architecture which will be implemented across Pascal and Volta chips but those chips have max bandwidth of 1.2 TB/s (Volta GPU). Moving forward, there exists a looming memory power crisis. HBM2 at 1.2 TB/s sure is great but it adds 60W to the power envelope on a standard GPU. The current implementation of HBM1 on Fiji chips adds around 25W to the chip. Moving onwards, chips with access of 2 TB/s bandwidth will increase the overall power limit on chips which will go from worse to breaking point. A chip with 2.5 TB/s HBM (2nd generation) memory will reach a 120W TDP for the memory architecture alone, a 1.5 times efficient HBM 2 architecture that outputs over 3 TB/s bandwidth will need 160W to feed the memory alone.

This is not the power of the whole chip mentioned but just the memory layout, typically, these chips will be considered non-efficient for the consumer and HPC sectors but NVIDIA is trying to change that and is exploring new means to solve the memory power crisis that exists ahead with HBM and higher bandwidth. In the near future, Pascal and Volta don't see a major consumption increase from HBM but moving onward in 2020, when NVIDIA's next gen architecture is expected to arrive, we will probably see a new memory architecture being introduced to solve the increased power needs.

We will be having more of these technical talks on upcoming GPU architectures as their launch approaches in 2016. To finish this post, NVIDIA confirmed that Pascal will be available in 2016 (as it was originally confirmed) on choice of CPU platforms ranging from x86, ARM64 and Power (IBM). On the HPC front, NVIDIA will introduce NVLINK while consumer and servers side will rely on PCI-E (16 GB/s) for communication via chips.

NVIDIA Pascal GPU Slides (GTC Taiwan 2015):

Next Generation FinFET Based GPUs Comparison (AMD/NVIDIA):

Flagship GPUVega 10Navi 10NVIDIA GP100NVIDIA GV100
GPU Process14nm FinFET7nm FinFETTSMC 16nm FinFETTSMC 12nm FinFET
GPU Transistors15-18 BillionTBC15.3 Billion21.1 Billion
GPU Cores (Max)4096 SPsTBC3840 CUDA Cores5376 CUDA Cores
Peak FP32 Compute13.0 TFLOPsTBC12.0 TFLOPs>15.0 TFLOPs (Full Die)
Peak FP16 Compute25.0 TFLOPsTBC24.0 TFLOPs120 Tensor TFLOPs
Memory (Consumer Cards)HBM2HBM3GDDR5XGDDR6
Memory (Dual-Chip Professional/ HPC)HBM2HBM3HBM2HBM2
HBM2 Bandwidth484 GB/s (Frontier Edition)>1 TB/s?732 GB/s (Peak)900 GB/s
Graphics ArchitectureNext Compute Unit (Vega)Next Compute Unit (Navi)5th Gen Pascal CUDA6th Gen Volta CUDA
Successor of (GPU)Radeon RX 500 SeriesRadeon RX 600 SeriesGM200 (Maxwell)GP100 (Pascal)