NVIDIA Pascal GP100 GPU Expected To Feature 12 TFLOPs of Single Precision Compute, 4 TFLOPs of Double Precision Compute Performance
New details on NVIDIA’s Pascal GPU have been dug up by 3DCenter (via Beyond3D) which showcase the total compute performance of the upcoming FinFET based chip. At CES 2016, NVIDIA announced their Pascal based Drive PX 2 module which is an automobile supercomputer which uses the graphics processing power of GPUs to drive cards autonomously. The presentation didn’t mention the flagship chip but we expect to hear a more detailed session on those at GTC in April 2016.
NVIDIA Pascal GP100 Flagship GPU Might Come With 12 TFLOPs of Single Precision, 4 TFLOPs of Dual Precision Compute
The one GPU that everyone has their eyes on right now, whether they be enterprise of mainstream audience, is the flagship Pascal GPU which is known as GP100 (The naming scheme for the chip is not confirmed yet). This is going to be the flagship chip of the lineup which will be featured on Tesla, Quadro and GeForce graphics cards. The chip is based on the 16nm FinFET process which leads to efficiency improvements and better performance per watt but with Pascal, double precision compute returns with a bang. Maxwell which is NVIDIA’s current gen architecture made some serious gains in the performance per watt department and Pascal is expected to keep the tradition move forward.
The information today comes from slides which have long existed but most people haven’t had access to. The slides were found by iMacmatican, a Beyond3D forum member who has compiled a good list of details over at the forum. Since most of these slides date back to 2014-2015, there are bound to be some changes to the GPU design which we will also explain in a bit. First of all, a slide from a presentation in March 2014 detailed GFLOPs per watt for various NVIDIA GPUs. The approximate values for NVIDIA’s CUDA generation of GPUs have been listed below:
- Tesla: 0.5
- Fermi: 2
- Kepler: 5.5
- Pascal: 14
- Volta: 22
Slide Credits: Beyond3D Forum
The slide clearly shows that Pascal is rated at GFLOPs per watt while Volta is rated at 22 GFLOPs per watt. Now the slide states that these approximations are for the Dual Precision or DGEMM (Dual Precision Floating General Matrix Multiply) GFLOPS/Watt and not single precision due to which Maxwell has been removed from the latter slides since it didn’t feature any FP64 hardware under the hood. The fastest Kepler based Tesla K40X comes with 6.1 GFLOPs/W and the dual-chip Tesla K80X comes with 6.2 GFLOPs/W. Pascal is expected to take this around 14 GFLOPs/W which is more than twice of Dual Precision GFLOPs/W.
Coming to the Single Precision or SGEMM (Single precision floating General Matrix Multiply) GFLOPs/W are rated at 42 GFLOPs/W for Pascal. Maxwell is rated at 23 GFLOPs/W with the dual-chip offering pushing that up to 25 GFLOPs/W while Volta is rated at 73 GFLOPs/W. Now there’s also a slide that details the HGEMM (Half Precision floating General Matrix Multiply). We know that Pascal and the latter generation of GPUs will come with mixed precision compute which allows users to get twice the compute performance in FP16 work loads compared to FP32 by computing at 16-bit with twice the accuracy of FP32. Compared to Maxwell which has just 26 half precision GFLOPs/W, Pascal will take that up to 85 GFLOPs/W while Volta will do up to 145 GFLOPs /W.
Coming to the more meaty part, 2014 slides are full of useful data on Pascal GPUs. Of course these slides pre date the time frame when Pascal GPUs actually taped out and entered NVIDIA Labs for testing which is what they stated themselves during SC15 and several months before that at GTC session in Japan. It is known that during some point, NVIDIA made the step to change their designs from HMC (Hybrid Memory Cube) based solutions to HBM2 based solutions and they presented the updated design to the audience at GTC 2015 in Japan.
The prototype Pascal board that was showcased back at GTC 2014 was actually based on an HMC implementation and that changed in 2015. From details mentioned in the slides, NVIDIA is claiming that they have integrated the memory (HBM2) to be part of the actual GPU die. Now this could mean one of two things, whether NVIDIA has actually managed to integrated HBM2 and a 16nm GPU on the same die or they could be using a similar design as the Fury cards from AMD which fuse the GPU and HBM chips on single interposer that makes them a single chip solution, sort of like an SOC.
What we know so far about Nvidia’s flagship Pascal GP100 GPU :
- Pascal graphics architecture.
- 2x performance per watt estimated improvement over Maxwell.
- To launch in 2016, purportedly the second half of the year.
- DirectX 12 feature level 12_1 or higher.
- Successor to the GM200 GPU found in the GTX Titan X and GTX 980 Ti.
- Built on the 16nm FinFET manufacturing process from TSMC.
- Allegedly has a total of 17 billion transistors, more than twice that of GM200.
- Will feature four 4-Hi HBM2 stacks, for a total of 16GB of VRAM and 8-Hi stacks for up to 32GB for the professional compute SKUs.
- Features a 4096-bit memory bus interface, same as AMD’s Fiji GPU power the Fury series.
- Features NVLink (only compatible with next generation IBM PowerPC server processors)
- Supports half precision FP16 compute at twice the rate of full precision FP32.
|GPU Architecture||NVIDIA Fermi||NVIDIA Kepler||NVIDIA Maxwell||NVIDIA Pascal|
|GPU Process||40nm||28nm||28nm||16nm (TSMC FinFET)|
|GPU Design||SM (Streaming Multiprocessor)||SMX (Streaming Multiprocessor)||SMM (Streaming Multiprocessor Maxwell)||SMP (Streaming Multiprocessor Pascal)|
|Maximum Transistors||3.00 Billion||7.08 Billion||8.00 Billion||15.3 Billion|
|Maximum Die Size||520mm2||561mm2||601mm2||610mm2|
|Stream Processors Per Compute Unit||32 SPs||192 SPs||128 SPs||64 SPs|
|Maximum CUDA Cores||512 CCs (16 CUs)||2880 CCs (15 CUs)||3072 CCs (24 CUs)||3840 CCs (60 CUs)|
|FP32 Compute||1.33 TFLOPs(Tesla)||5.10 TFLOPs (Tesla)||6.10 TFLOPs (Tesla)||~12 TFLOPs (Tesla)|
|FP64 Compute||0.66 TFLOPs (Tesla)||1.43 TFLOPs (Tesla)||0.20 TFLOPs (Tesla)||~6 TFLOPs(Tesla)|
|Maximum VRAM||1.5 GB GDDR5||6 GB GDDR5||12 GB GDDR5||16 / 32 GB HBM2|
|Maximum Bandwidth||192 GB/s||336 GB/s||336 GB/s||720 GB/s - 1 TB/s|
|Launch Year||2010 (GTX 580)||2014 (GTX Titan Black)||2015 (GTX Titan X)||2016|
We have seen several slides but there’s one from an independent researcher who’s also a CUDA fellow who posted the compute performance for several platforms in his presentation. The slide puts the NVIDIA Pascal GPU with Stacked DRAM (1 TB/s) featuring up to 4 TFLOPs of Double Precision (FP64) and 12 TFLOPs of Single Precision (FP32) compute performance. In a slide from 2014, after launch of the second generation Pascal GPUs, an NVIDIA presentation also mentioned a GPU known as Pascal-Solo (not to be mistaken with Han-Solo) in the slide showcasing their Tesla GPU accelerator roadmap. The Pascal-Solo GPU features just 1 GPU and has a 235W TDP. The part comes in both PCI-e Active/Passive cooling options and is expected to launch in 2016. The Beyond3D Forum member approximated that the Tesla GPU could launch in Q2 of 2016.
There’s no doubt that Pascal GPUs will feature a lot of compute performance aimed at the Tesla and Quadro markets. The next generation FinFET based graphics cards will have a lot of muscle to flex toward the complex tasks that are put forward in the HPC workloads. Expect GTC 2016 to bring a lot of new information on Pascal based Tesla solutions.
|GPU Family||AMD Vega||AMD Navi||NVIDIA Pascal||NVIDIA Volta|
|Flagship GPU||Vega 10||Navi 10?||NVIDIA GP100||NVIDIA GV100|
|GPU Process||14nm FinFET||7nm FinFET?||TSMC 16nm FinFET||TSMC 12nm FinFET|
|GPU Transistors||15-18 Billion||TBC||15.3 Billion||21.1 Billion|
|GPU Cores (Max)||4096 SPs||TBC||3840 CUDA Cores||5376 CUDA Cores|
|Peak FP32 Compute||12.5 TFLOPs||TBC||12.0 TFLOPs||15.0 TFLOPs|
|Peak FP16 Compute||25.0 TFLOPs||TBC||24.0 TFLOPs||120 Tensor TFLOPs|
|VRAM||16 GB HBM2||TBC||16 GB HBM2||16 GB HBM2|
|Memory (Consumer Cards)||HBM2||HBM3||GDDR5X||GDDR6|
|Memory (Dual-Chip Professional/ HPC)||HBM2||HBM3||HBM2||HBM2|
|HBM2 Bandwidth||480 GB/s (Instinct MI25)||>1 TB/s?||732 GB/s (Peak)||900 GB/s|
|Graphics Architecture||Next Compute Unit (Vega)||Next Compute Unit (Navi)||5th Gen Pascal CUDA||6th Gen Volta CUDA|
|Successor of (GPU)||Radeon RX 500 Series?||Radeon RX 600 Series?||GM200 (Maxwell)||GP100 (Pascal)|