Nvidia Pascal Architecture Detailed – DX12 Async Compute & Scheduling Improved, CUDA Core Clusters Entirely Redsigned
Today at the 2016 GPU Technology Conference Nvidia announced the Tesla P100, the company’s most ambitious graphics card to date. The P100 features Nvidia’s most powerful and most complex GPU ever conceived by the company, code named GP100. This flagship Pascal GPU is an engineering marvel, and in this piece we’ll provide an overview of the Pascal architecture and in particular all the details that Nvidia has revealed about this spectacular graphics chip and Pascal architecture.
This overview is derived from an excellent talk given by Nvidia’s Senior Architect, Lars Nyland and Chief Technologist, GPU Computing Software Mark Harris.
So let’s get straight to it!
The Five “Miracles” Of Nvidia’s GP100 GPU & Tesla P100 Accelerator
At his keynote earlier today, jen-Husn Huang, Nvidia’s Co-Founder & CEO jokingly said that Nvidia never relies on more than one technical miracle with a given architecture. Despite that, with the GP100 GPU the company was successful in creating the most ambitious and most miraculous graphics chip to date, by relying on not one but five technological “miracles”.
Jen-Hsun summarized these miracles in the slide above. And they are :
– Next generation Pascal graphics architecture.
– TSMC’s 16nm FinFET manufacturing process technology.
– Next generation, vertically stacked High Bandwidth Memory ( HBM 2 )
– The company’s brand new revolution in platform atomics, the high speed NV-Link GPU interconnect.
– And finally, the workload that GP100 was designed for and excells at, AI.
Nvidia’s Pascal Architecture & The GP100 GPU, Opening The Taps
It has been a long tradition at Nvidia to introduce major performance and power efficiency advancements with each of its next generation graphics architectures and Pascal is no exception. The pivotal structure that’s the basic building block for every Pascal GPU is called the SM, short for streaming multiprocessor. Maxwell before Pascal had the SMM , Streaming Maxwell Multiprocessor, as its building block and Kepler before both had the SMX. .The streaming multiprocessor is the engine that “creates, manages, schedules and executes instructions from many threads in parallel.”
The GP100 GPU is comprised of 3840 CUDA cores, 240 texture units and a 4096bit memory interface, arranged in eight 512bit segments. The 3840 CUDA cores make up six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors.
Nvidia Pascal GP100 GPU Block Diagram
Each Pascal streaming multiprocessor includes 64 FP32 CUDA cores, half that of Maxwell. Within each Pascal streaming multiprocessor there are two 32 CUDA core partitions, two dispatch units, a warp scheduler and a fairly large instruction buffer, matching that of Maxwell.
The GP100 GPU is actually enormous coming in roughly at 610mm² and 15 billion transistors, pretty much making it double the GM200 GPU powering NVidia’s GTX Titan X and GTX 980 Ti graphics cards. GP100 has significantly more pascal streaming multiprocessors, or CUDA core blocks, compared to GM200. Again because each Pascal SM is only comprised of 64 CUDA cores as opposed to 128 like in Maxwell.
Additionally because each Pascal SM the same number of registers as Maxwell’s 128 CUDA core SMM. This translates to each Pascal CUDA core having access to twice the registers. This in turn means that not only does GP100 has more threads than Nvidia’s prior large GPUs, but each thread inside has access to more registers and thus a lot more throughput.
As always the goal was to deliver higher performance and improved power efficiency. As such Pascal builds on the changes that were implemented into Maxwell after Kepler.
The Pascal Streaming Multiprocessor
The combined 14MB of register files and 4MB Overall shared memory across the GP100 GPU result in a two fold increase in overall bandwidth inside the chip compared to GM200.
Chief Technologist, GPU Computing Software Mark Harris
A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).
According to Nvidia the end result is that each Pascal SM actually requires less power and area to manage data transfers even compared to a Kepler SMX. Which improves both performance and power efficiency. Pascal also includes an updated scheduler that not only improves SM utilization ( editorial note : better async compute performance anyone?.. ) but is also more intelligent and power efficient. Finally, each warp scheduler can dispatch two instructions per clock.
Nvidia’s Senior Architect, Lars Nyland admits that the 16nm FinFET process played an important role in realizing the team’s power efficiency goals, but maintains that numerous architectural improvements aided in further reducing the energy footprint of the architecture.
The table below is a high-level comparison of the Tesla P100’s specifications in comparison with previous generation Tesla accelerators.
|Tesla Products||Tesla K40||Tesla M40||Tesla P100|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)|
|FP32 CUDA Cores / SM||192||128||64|
|FP32 CUDA Cores / GPU||2880||3072||3584|
|FP64 CUDA Cores / SM||64||4||32|
|FP64 CUDA Cores / GPU||960||96||1792|
|Base Clock||745 MHz||948 MHz||1328 MHz|
|GPU Boost Clock||810/875 MHz||1114 MHz||1480 MHz|
|Compute Performance - FP32||5.04 TFLOPS||6.82 TFLOPS||10.6 TFLOPS|
|Compute Performance - FP64||1.68 TFLOPS||0.21 TFLOPS||5.3 TFLOPS|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2|
|Memory Size||Up to 12 GB||Up to 24 GB||16 GB|
|L2 Cache Size||1536 KB||3072 KB||4096 KB|
|Register File Size / SM||256 KB||256 KB||256 KB|
|Register File Size / GPU||3840 KB||6144 KB||14336 KB|
|TDP||235 Watts||250 Watts||300 Watts|
|Transistors||7.1 billion||8 billion||15.3 billion|
|GPU Die Size||551 mm²||601 mm²||610 mm²|
A quick look at the table above shows one of the wonderful advantages of FinFET besides the area and power improvements and that’s the much faster transistor switching speeds. This has clearly translated to significantly higher clock speeds for Nvidia with the Pascal GP100 GPU compared to its 28nm predecessors. The Tesla P100 actually features a boost frequency of 1480mhz, very nearly touching 1.5Ghz.
That’s a whopping 33% gain in clock speeds over Maxwell. Considering how the GeForce GTX 900 series graphics cards can be overclocked to 1.5Ghz and beyond with ease I have very little doubts that we’ll see enthusiasts pushing their GeForce Pascal graphics cards to 2Ghz and beyond with little effort.
Serious Compute Is Back! GP100 Features A 1:2 Ratio Of FP64 to FP32
GP100 is Nvidia’s first GPU ever to feature double precision compute performance at half the rate of single precision compute. The Kepler based GK110 featured a ratio of 3:1 and Maxwell was almost completely ridden of double precision with a ratio of 32:1. That is for every block of 32 FP32 CUDA cores there was only 1 FP64 CUDA core. Pascal brings Nvidia back to the HPC , High Performance Computing, space where double precision rules the roost.
This is an area where AMD’s Hawaii GPU was simply uncontested since it launched in late 2013, being the only GPU from either company on the market to sport a 2:1 ratio of FP32 to FP64.
Interestingly, the changes that Nvidia has been implementing in its streaming multiprocessors over the past several years, starting with 192 CUDA core Kepler SMX in 2011 to the Maxwell 128 CUDA core SMM and finally to Pascal have been morphing the company’s graphics architecture to something that’s much closer to that of AMD’s GCN. The basic building block of which, the Compute Unit, has 64 GCN cores.
The similarities don’t end there either, with Pascal Nvidia is renewing its focus on double precision compute, an area that GCN has traditionally excelled at. The updates that Nvidia has made to Pascal’s scheduler are also a clear indicator that it’s moving its architecture towards the same direction that AMD has taken with its GCN which supports advanced hardware scheduling implementations and unique asynchronous compute engines.
Nvidia has also confirmed that Pascal is compliant with IEEE 754‐2008 single and double precision arithmetic and supports FMA, Fused Multiply Add, instructions operation in addition to denormalized values at full speed.
FP16 At Double The Rate of FP32 Is A BIG Deal For Deep Learning
If you watched Jen-Hsun’s GTC keynote earlier today you will know that it was all about deep learning. This field is already reshaping the future and what we as humans perceive of our own intelligence and the limits of AI. I’m not going to deep dive on deep learning, no pun intended, in this particular overview. If you’re interested in learning more about this subject Nvidia’s Tim Dettmers has written a great piece about it titled “Deep Learning in a Nutshell: Core Concepts” that you should check out.
Deep learning workloads represent a perfect scenario where mixed precision can be leveraged to pretty much double the performance. These workloads inherently require less precision and using FP16 instructions would result in very significant reductions in memory usage that will allow deep learning to occur in considerably larger networks. Essentially allowing machines to learn much more effectively.
Because each Pascal CUDA core can run two FP16 operations at once and each 32-bit register can store two FP16 values at once, the GP100 GPU can effectively do FP16 compute work at twice the speed of FP32, and this is where that doubling in performance comes from.
Nvidia Bringing Improved Memory Coherency With Pascal
Memory coherency is an essential attribute of modern accelerators. It allows data to flow freely, to be shared without unnecessary copies or any wasteful energy burning protocols. AMD, Nvidia’s principle rival, has pushed the development of memory coherency in its IP much more aggressively, even compared to the much larger rival of both companies Intel. Primarily because AMD has built its future around heterogeneous computing. Memory coherency was essential to creating a truly heterogeneous APU, accelerated processing unit, ( A processor that includes a CPU and a GPU ).
In fact some of the very early work of the HSA Foundation ,Heterogeneous System Architecture, which AMD founded early in the decade was to realize the goal of truly coherent shared memory. As such AMD’s GCN – graphics core next – architecture was designed with this in mind. Intel for the very similar reasons, quickly caught up and successfully introduced memory coherency to its chips. Nvidia’s introduction of Maxwell marked Nvidia’s entrance to the party and naturally it’s been improved upon even further with Pascal.
Pascal now supports coherent FP64 add instructions in global memory, something that Maxwell only supported with compare-and-swap loops. Enabling this functionlaity via a native instruction inherently improves performance. This addition is only a logical one. Pascal’s double precision compute capability far outsteps Maxwell’s, so extending coherency to FP64 instructions makes perfect sense.
|GPU||Kepler GK110||Maxwell GM200||Pascal GP100||Volta GV100|
|Threads / Warp||32||32||32||32|
|Max Warps / Multiprocessor||64||64||64||64|
|Max Threads / Multiprocessor||2048||2048||2048||2048|
|Max Thread Blocks / Multiprocessor||16||32||32||32|
|Max 32-bit Registers / SM||65536||65536||65536||65536|
|Max Registers / Block||65536||32768||65536||65536|
|Max Registers / Thread||255||255||255||255|
|Max Thread Block Size||1024||1024||1024||1024|
|CUDA Cores / SM||192||128||64||64|
|Shared Memory Size / SM Configurations (bytes)||16K/32K/48K||96K||64K||96K|
Next Generation Memory Technology- HBM2 Is Key
Both GPU makers have talked to great lengths about how detrimental the slow progression of memory standards has been to the steady growth of the performance of modern parallel processors. GPU design had reached a point where any additional compute performance was offset by the energy spent on the memory ecosystem necessary to feed the GPU with the bandwidth it needs to deliver its intended performance.
HotChips 2012 – Die Stacking & The System by Bryan Black, AMD’s head of the die stacking program
The issue of memory bandwidth and HBM‘s role in tackling this challenge actually came to the forefront four years ago, when AMD’s Bryan Black publicly spoke about it for the first time. Fast forward to today and we have HBM products on the market from AMD with second generation HBM to be deployed in the not too distant future on GPUs from both vendors. So HBM2 should see much wider spread use in the industry as adoption and volume pick up.
Second generation stacked High Bandwidth Memory plays an instrumental role in allowing the GP100 GPU to reach its performance potential. Simply put without HBM GP100 wouldn’t exist, and high performance GPU design would fall demonstrably behind moore’s law.
HBM2 allows nvidia to tackle two challenges with the Tesla P100. Having enough bandwidth to keep the execution engines fed and having enough memory capacity overall to do the actual work. Especially, again, in deep learning workloads where there are massive data sets that eat through the capacity of the frame buffer.
In addition to delivering significantly higher density and bandwidth compared to GDDR5, HBM2 is also considerably more power efficient. The Tesla P100 package includes four 4-Hi HBM2 stacks, for a total of 16 GB of memory, and 720 GB/s peak bandwidth. That’s three times as much bandwidth as Nvidia’s previous flagship accelerator the Tesla M40.
Interestingly the Telsa P100’s bandwidth figure is below that of the JEDEC HBM2 spec that SK Hynix & Samsung both adhere to. Which dictates that every 4-HI HBM2 stack should operate at a 2Ghz clock speed to deliver 256GB/s of bandwidth for a total of 1TB/s for four stacks. The HBM2 modules on the Tesla P100 actually operate well below the spec at only 1.4Ghz. Considering that the Tesla P100 is rated at a surprisingly high TDP of 300W, 50 watts more than the Telsa M40. This could then perhaps be a conscious decision on the part of Nvidia to reduce the overall power of the package.
SC15 ( Super Computing 2015 ) – Dr. Stephen W. Keckler, Senior Director of Architectural Research
Well there you have it folks. This year’s GTC definitely did not disappoint. It’s been a double whammy for enthusiasts. First, Nvidia’s announcement of its most powerful GPU yet, the Pascal flagship that everyone has been eager to hear more about. Second was the surprisingly deep level of detail that the company had revealed about its next generation Pascal architecture. The detailed specs that Nvidia released for its GP100 GPU have also been a pleasant treat.
We can’t wait to see what Nvidia has in store for us with its Pascal powered, next generation GeForce GPUs. We’re certainly hoping that Nvidia’s preparing just as strong of a double whammy this summer with its GP104 based GTX 980 and GTX 970 successors.
Full slide deck [Nvidia GTC 2016 – Jen-Hsun Huang Keynote]