Nvidia : Pascal Is 10x Faster Than Maxwell, Launching in 2016 On 16nm – Features 3D Memory, NV-Link and Mixed Precision
Pascal will feature 4X the mixed precision performance, 2X the performance per watt, 2.7X memory capacity & 3X the bandwidth of Maxwell. Nvidia’s CEO went on to state that all in all Pascal is Maxwell times ten. All of this has just been revealed here at GTC. There’s a lot to digest here, so let’s break it down.
Nvidia states that pascal will be the company’s first high performance GPU to feature mixed precision floating point compute FP16. Which is essential for low power devices such as tablets and mobile phones. Mixed precision is also very beneficial from a power efficiency stand point for many compute applications which don’t strictly require higher precision FP32 or FP64 compute which would benefit greatly from this addition.
Nvidia : Pascal Is Maxwell Times 10 – Features Mixed Precision, 3D Memory and NV-Link Coming in 2016
Nvidia’s CEO went on to state that pascal has 10x of Maxwell’s performance and he arrived at this conclusion via what he calls “CEO math”. Obviously this was just a humorous way to impress the crowd at GTC 2015 and is based on what was described as “very rough estimates”.
The idea is that if we look at all the improvements coming up with Pascal compared to Maxwell, they will collectively add up to make it “roughly” 10 times more efficient at deep learning compute tasks. Pascal will feature 3x the memory bandwidth of Maxwell, 2x peak single precision compute performance and 2x the performance per watt.
Besides providing a very catchy claim that the press can use in their headlines for today’s announcement, these improvements should enable the architecture to theoretically be significantly faster than its predecessor, Maxwell, at deep-learning / artificial intelligence workloads.
Admittedly Nvidia concedes that it’s unrealistic to see anything like a 10X speed-up in the real-world, except in select high performance computing and super-computing case scenarios. Where getting rid of the massive communication over-head between the various processors and the Nvidia GPU accelerators may contribute greatly to reducing the total time and energy needed to complete the necessary work.
There are four hallmark technologies for the Pascal generation of GPUs. Namely HBM, mixed precision compute, NV-Link and the smaller, more power efficient TSMC 16nm FinFET manufacturing process. Each is very important in its own right and as such we’re going to break down everyone of these four separately.
Pascal To Be Nvidia’s First Graphics Architecture To Feature High Bandwidth Memory HBM
Stacked memory will debut on the green side with Pascal. HBM Gen2 more precisely, the second generation of the SK Hynix AMD co-developed high bandwidth JEDEC memory standard. The new memory will enable memory bandwidth to exceed 1 Terabyte/s which is 3X the bandwidth of the Titan X. The new memory standard will also allow for a huge increase in memory capacities, 2.7X the memory capacity of Maxwell to be precise. Which indicates that the new Pascal flagship will feature 32GB of video memory, a mind-bogglingly huge number.
We’ve already seen AMD take advantage ofHBM memory technology with its Fiji XT GPU. Which will feature 512GB/S of memory bandwidth, which is twice that of the GTX 980. AMD has also stated that it plans to use the second generation of this new memory technology in its Arctic Islands family of GPUs in 2016. So we’re likely to see both red and green rocking second generation stacked HBM next year.
HBM achieves this amazing improvement in memory bandwidth and capacity by employing a very wide through-silicon-via memory interface. Each HBM cube is connected to the GPU with a 1024bit wide memory bus. HBM modules actually operate at low frequencies compared to GDDR5 but thanks to the significantly wider memory interface they manage to be up to 9 times faster than standard GDDR5 memory modules.
We’ve already covered this revolutionary new memory technology exclusively and in-depth last year. HBM will quickly replace GDDR5 as the standard memory technology for high performance graphics solutions. It’s fair to say that HBM is the future.
Pascal Is Nvidia’s First Graphics Architecture To Deliver Half Precision Compute FP16 At Double The Rate Of Full Precision FP32
One of the more significant features that was revealed for Pascal was the addition of 16FP compute support, otherwise known as mixed precision compute or half precision compute. At this mode the accuracy of the result to any computational problem is significantly lower than the standard 32FP method, which is required for all major graphics programming interfaces in games and has been for more than a decade. This includes DirectX 12, 11, 10 and DX9 Shader model 3.0 which debuted almost a decade ago. This makes mixed precision mode unusuable for any modern gaming application.
However due to its very attractive power efficiency advantages over FP32 and FP64 it can be used in scenarios where a high degree of computational precision isn’t necessary. Which makes mixed precision computing especially useful on power limited mobile devices. Nvidia’s Maxwell GPU architecture feature in the GTX 900 series of GPUs is limited to FD32 operations, this in turn means that FP16 and FP32 operations are processed at the same rate by the GPU. However, adding the mixed precision capability in Pascal means that the architecture will now be able to process FP16 operations twice as quickly as FP32 operations. And as mentioned above this can be of great benefit in power limited, light compute scenarios.
Nvidia’s Proprietary High-Speed Platform Atomics Interconnect For Servers And Supercomputers – NV-Link
Pascal will also be the first Nvidia GPU to feature the company’s new NV-Link technology which Nvidia states is 5 to 12 times faster than PCIE 3.0.
NVLink is an energy-efficient, high-bandwidth communications channel that uses up to three times less energy to move data on the node at speeds 5-12 times conventional PCIe Gen3 x16. First available in the NVIDIA Pascal GPU architecture, NVLink enables fast communication between the CPU and the GPU, or between multiple GPUs. Figure 3: NVLink is a key building block in the compute node of Summit and Sierra supercomputers.
VOLTA GPU Featuring NVLINK and Stacked Memory NVLINK GPU high speed interconnect 80-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit.
NVLink is a key technology in Summit’s and Sierra’s server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to access each other’s memory fast and seamlessly. From a programmer’s perspective, NVLink erases the visible distinctions of data separately attached to the CPU and the GPU by “merging” the memory systems of the CPU and the GPU with a high-speed interconnect. Because both CPU and GPU have their own memory controllers, the underlying memory systems can be optimized differently (the GPU’s for bandwidth, the CPU’s for latency) while still presenting as a unified memory system to both processors. NVLink offers two distinct benefits for HPC customers. First, it delivers improved application performance, simply by virtue of greatly increased bandwidth between elements of the node. Second, NVLink with Unified Memory technology allows developers to write code much more seamlessly and still achieve high performance. via NVIDIA News
#4 16nm manufacturing process : Pascal will the first Nvidia GPU to be built on TSMC’s 16nm FinFET manufacturing process. The new process promises to be significantly more power efficient and significantly more dense than 28nm. Which would enable Nvidia to build significantly more complex and powerful GPUs all the while significantly improving power efficiency.
TSMC’s 16FF+ (FinFET Plus) technology can provide above 65 percent higher speed, around 2 times the density, or 70 percent less power than its 28HPM technology. Comparing with 20SoC technology, 16FF+ provides extra 40% higher speed and 60% power saving. By leveraging the experience of 20SoC technology, TSMC 16FF+ shares the same metal backend process in order to quickly improve yield and demonstrate process maturity for time-to-market value.
Pascal is still scheduled for a 2016 release with Volta coming along sometime after that.
[2016 UPDATE] Nvidia’s Pascal : Everything We Know Right Now
We found out in 2015 that Nvidia’s flagship Pascal GPU code named GP100 may have taped out on TSMC’s 16nm FinFET manufacturing process in June. Funnily very soon after AMD announced that it had taped out two FinFET chips as well. It’s not a coincidence either that Nvidia and AMD taped out their FinFET designs in the same time period. They’re trying to meet a very aggressive time to market schedule with Pascal and Polaris. And are zooming in on a Q3-Q4 product introduction of their 14nm and 16nm FinFET GPUs in 2016.
What we know so far about Nvidia’s flagship Pascal GP100 GPU :
- Pascal graphics architecture.
- 2x performance per watt estimated improvement over Maxwell.
- To launch in 2016, purportedly the second half of the year.
- DirectX 12 feature level 12_1 or higher.
- Successor to the GM200 GPU found in the GTX Titan X and GTX 980 Ti.
- Built on the 16nm FinFET manufacturing process from TSMC.
- Allegedly has a total of 17 billion transistors, more than twice that of GM200.
- Will feature four 4-Hi HBM2 stacks, for a total of 16GB of VRAM and 8-Hi stacks for up to 32GB for the professional compute SKUs.
- Features a 4096-bit memory bus interface, same as AMD’s Fiji GPU power the Fury series.
- Features NVLink (only compatible with next generation IBM PowerPC server processors)
- Supports half precision FP16 compute at twice the rate of full precision FP32.
|GPU Architecture||NVIDIA Fermi||NVIDIA Kepler||NVIDIA Maxwell||NVIDIA Pascal|
|GPU Process||40nm||28nm||28nm||16nm (TSMC FinFET)|
|GPU Design||SM (Streaming Multiprocessor)||SMX (Streaming Multiprocessor)||SMM (Streaming Multiprocessor Maxwell)||SMP (Streaming Multiprocessor Pascal)|
|Maximum Transistors||3.00 Billion||7.08 Billion||8.00 Billion||15.3 Billion|
|Maximum Die Size||520mm2||561mm2||601mm2||610mm2|
|Stream Processors Per Compute Unit||32 SPs||192 SPs||128 SPs||64 SPs|
|Maximum CUDA Cores||512 CCs (16 CUs)||2880 CCs (15 CUs)||3072 CCs (24 CUs)||3840 CCs (60 CUs)|
|FP32 Compute||1.33 TFLOPs(Tesla)||5.10 TFLOPs (Tesla)||6.10 TFLOPs (Tesla)||~12 TFLOPs (Tesla)|
|FP64 Compute||0.66 TFLOPs (Tesla)||1.43 TFLOPs (Tesla)||0.20 TFLOPs (Tesla)||5.5 TFLOPs(Tesla)|
|Maximum VRAM||1.5 GB GDDR5||6 GB GDDR5||12 GB GDDR5||16 / 32 GB HBM2|
|Maximum Bandwidth||192 GB/s||336 GB/s||336 GB/s||1 TB/s|
|Launch Year||2010 (GTX 580)||2014 (GTX Titan Black)||2015 (GTX Titan X)||2016|
NVIDIA Volta GPUs, successors to Pascal, will arrive with IBM Power9 CPUs Enabled Supercomputers in 2017The technology targets GPU accelerated servers where the cross-chip communication is extremely bandwidth limited and a major system bottleneck. Nvidia states that NV-Link will be up to 5 to 12 times faster than traditional PCIE 3.0 making it a major step forward in platform atomics. Earlier this year Nvidia announced that IBM will be integrating this new interconnect into its upcoming PowerPC server CPUs. NVLink will debut with Nvidia’s Pascal in 2016 before it makes its way to Volta in 2018.
Pascal brings many new improvements to the table both in terms of hardware and software. However, the focus is crystal clear and is 100% about pushing power efficiency and compute performance higher than ever before. The plethora of new updates to the architecture and the ecosystem underline this focus.
Pascal will be the company’s first graphics architecture to use next generation stacked memory technology, HBM. It will also be the first ever to feature a brand new from the ground-up high-speed proprietary interconnect, NV-Link. Mixed precision support is also going to play a major role in introducing a step function improvement in perf/watt in mobile applications.
|GPU Family||Vega||NVIDIA Pascal|
|Flagship GPU||Vega 10||GP102|
|GPU Process||14nm FinFET||16nm FinFET|
|GPU Transistors||Up To 18 Billion||12 Billion|
|Memory||Up to 32 GB HBM2||12GB GDDR5X|
|Bandwidth||1 TB/s||480 GB/s|
|Graphics Architecture||Polaris ( GCN 4.0 )||Pascal|
|Predecessor||Fiji (Fury Series)||GM200 (900 Series)|