Nvidia Unveils Pascal Tesla P100 With Over 20 TFLOPS Of FP16 Performance – Powered By GP100 GPU With 15 Billion Transistors & 16GB Of HBM2

Khalid Moammer
Posted Apr 5, 2016
612Shares
Share Tweet Submit

Nvidia has just unveiled its fastest GPU yet here at GTC 2016, a brand new graphics chip based on the company’s next generation Pascal architecture. The GP100 is NVIDIA’s most advanced GPU to date, powering the company’s next generation compute monster, the Tesla P100.

Nvidia GTC-11
Nvidia claims that GP100 is the largest FinFET GPU that has ever been made, measuring at 600mm² and packing over 15 billion transistors.  The Tesla P100 features a slightly cut back GP100 GPU and delivers 5.3 teraflops of double precision compute, 10.6 TFLOPS of single precision compute and 21.2 TFLOPS of half precision FP16 compute. Keeping this massive GPU fed is 4MB of L2 cache and a whopping 14MB worth of register files.

The entire Telsa P100 package is comprised of many chips not just the GPU, that collectively add up to over 150 billion transistors and features 16GB of stacked HBM2 VRAM for a total of 720GB/s of bandwidth. Nvidia’s CEO & Co-Founder Jen-Hsun Huang confirmed that this behemoth of a graphics card is already in volume production with samples already delivered to customers which will begin announcing their products in Q4 and will be shipping their products in Q1 2017.

NVIDIA Tesla P100 Quotes

Pascal GP100 Architecture & Specs

Nvidia Press Release

Five Architectural Breakthroughs
The Tesla P100 delivers its unprecedented performance, scalability and programming efficiency based on five breakthroughs:

  • NVIDIA Pascal architecture for exponential performance leap — A Pascal-based Tesla P100 solution delivers over a 12x increase in neural network training performance compared with a previous-generation NVIDIA Maxwell™-based solution.

  • NVIDIA NVLink for maximum application scalability — The NVIDIA NVLink™ high-speed GPU interconnect scales applications across multiple GPUs, delivering a 5x acceleration in bandwidth compared to today’s best-in-class solution1. Up to eight Tesla P100 GPUs can be interconnected with NVLink to maximize application performance in a single node, and IBM has implemented NVLink on its POWER8 CPUs for fast CPU-to-GPU communication.

  • 16nm FinFET for unprecedented energy efficiency — With 15.3 billion transistors built on 16 nanometer FinFET fabrication technology, the Pascal GPU is the world’s largest FinFET chip ever built2. It is engineered to deliver the fastest performance and best energy efficiency for workloads with near-infinite computing needs.

  • CoWoS with HBM2 for big data workloads — The Pascal architecture unifies processor and data into a single package to deliver unprecedented compute efficiency. An innovative approach to memory design, Chip on Wafer on Substrate (CoWoS) with HBM2, provides a 3x boost in memory bandwidth performance, or 720GB/sec, compared to the Maxwell architecture.

  • New AI algorithms for peak performance — New half-precision instructions deliver more than 21 teraflops of peak performance for deep learning.

The GP100 GPU is comprised of  3840 CUDA cores, 240 texture units and a 4096bit memory interface. The 3840 CUDA cores are arranged in six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors. As mentioned earlier in the article the Tesla P100 features a cut down GP100 GPU. This cut back version has 3584 CUDA cores and 224 texture mapping units.

Pascal Tesla P100 GPU Board

Each Pascal streaming multiprocessor includes 64 FP32 CUDA cores, half that of Maxwell. Within each Pascal streaming multirprocessor there are two 32 CUDA core partitions, two dispatch units, a warp scheduler and a fairly large instruction buffer, matching that of Maxwell.

Pascal GP100

The massive GP100 GPU has significantly more pascal streaming multiprocessors, or CUDA core blocks.  Because each of these has access to a register file that’s the same size of Maxwell’s 128 CUDA core SMM. This means that each Pascal CUDA core has access to twice the register files. In turn we should expect even more performance out of each Pascal CUDA cores compared to Maxwell.

NVIDIA GP100 Block Diagram

Nvidia Press Release

Tesla P100 Specifications
Specifications of the Tesla P100 GPU accelerator include:

  • 5.3 teraflops double-precision performance, 10.6 teraflops single-precision performance and 21.2 teraflops half-precision performance with NVIDIA GPU BOOST™ technology

  • 160GB/sec bi-directional interconnect bandwidth with NVIDIA NVLink

  • 16GB of CoWoS HBM2 stacked memory

  • 720GB/sec memory bandwidth with CoWoS HBM2 stacked memory

  • Enhanced programmability with page migration engine and unified memory

  • ECC protection for increased reliability

  • Server-optimized for highest data center throughput and reliability

Tesla P100 Boosts To Nearly 1.5Ghz

Perhaps one of the most exciting, yet perhaps predictable, revaluations about the GP100 Pascal flagship GPU is that it can achieve clocks even higher than Maxwell. Despite Nvidia opting for very conservative clock speeds on its professional GPUs like the Tesla & Quadro products the P100 actually has a base clock speed of 1328mhz and a boost clock speed of 1480mhz. Considering that GPU Boost 2.0 actually allows these cards to operate at even higher clock speeds than the nominal boost clock.

Nvidia GTX 1060 Already Available In Stores, Officially Launching July 19th

We’re looking at actual frequencies of upwards of 1500Mhz on the GeForce equivalent of the P100. What is inevitably going to launch as the next GTX Titan. This means boost clocks of even upwards of 1600Mhz on factory overclocked models, and perhaps 2Ghz+ manual overclocks. This should be extremely exciting news to all GeForce fans.

Tesla ProductsTesla K40Tesla M40Tesla P100
GPUGK110 (Kepler)GM200 (Maxwell)GP100 (Pascal)
SMs152456
TPCs152428
FP32 CUDA Cores / SM19212864
FP32 CUDA Cores / GPU288030723584
FP64 CUDA Cores / SM64432
FP64 CUDA Cores / GPU960961792
Base Clock745 MHz948 MHz1328 MHz
GPU Boost Clock810/875 MHz1114 MHz1480 MHz
Compute Performance - FP32 5.04 TFLOPS6.82 TFLOPS10.6 TFLOPS
Compute Performance - FP64 1.68 TFLOPS0.21 TFLOPS5.3 TFLOPS
Texture Units240192224
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM2
Memory SizeUp to 12 GBUp to 24 GB16 GB
L2 Cache Size1536 KB3072 KB4096 KB
Register File Size / SM256 KB256 KB256 KB
Register File Size / GPU3840 KB6144 KB14336 KB
TDP235 Watts250 Watts300 Watts
Transistors7.1 billion8 billion15.3 billion
GPU Die Size551 mm²601 mm²610 mm²
Manufacturing Process28-nm28-nm16-nm

Nvidia Pascal – 2X Perf/Watt With 16nm FinFET, Stacked Memory ( HBM2 ), NV-Link And Mixed Precision Compute

There are four hallmark technologies for the Pascal generation of GPUs. Namely HBM, mixed precision compute, NV-Link and the smaller, more power efficient TSMC 16nm FinFET manufacturing process. Each is very important in its own right and as such we’re going to break down everyone of these four separately.

Pascal To Be Nvidia’s First Graphics Architecture To Feature High Bandwidth Memory HBM

Stacked memory will debut on the green side with Pascal. HBM Gen2 more precisely, the second generation of the SK Hynix AMD co-developed high bandwidth  JEDEC memory standard.  The new memory will enable memory bandwidth to exceed 1 Terabyte/s which is 3X the bandwidth of the Titan X. The new memory standard will also allow for a huge increase in memory capacities, 2.7X the memory capacity of Maxwell to be precise. Which indicates that the new Pascal flagship will feature 32GB of video memory, a mind-bogglingly huge number.

We’ve already seen AMD take advantage of HBM memory technology with its Fiji XT GPU last year. Which features 512GB/S of memory bandwidth, twice that of the GTX 980. AMD has also announced last month at its Capsaicin event that it will be bringing HBM2 with its next generation Vega architecture, succeeding its 14nm FinFET Polaris architecture launching this summer with GDDR5 memory.

TSMC’s new 16nm FinFET process promises to be significantly more power efficient than planar 28nm. It also promises to bring about a considerable improvement in transistor density. Which would enable Nvidia to build faster, significantly more complex and more power efficient GPUs.

Pascal Is Nvidia’s First Graphics Architecture To Deliver Half Precision Compute FP16 At Double The Rate Of Full Precision FP32

One of the more significant features that was revealed for Pascal was the addition of 16FP compute support, otherwise known as mixed precision compute or half precision compute. At this mode the accuracy of the result to any computational problem is significantly lower than the standard 32FP method, which is required for all major graphics programming interfaces in games and has been for more than a decade. This includes DirectX 12, 11, 10 and DX9 Shader model 3.0 which debuted almost a decade ago. This makes mixed precision mode unusuable for any modern gaming application.

However due to its very attractive power efficiency advantages over FP32 and FP64 it can be used in scenarios where a high degree of computational precision isn’t necessary. Which makes mixed precision computing especially useful on power limited mobile devices. Nvidia’s Maxwell GPU architecture feature in the GTX 900 series of GPUs is limited to FD32 operations, this in turn means that FP16 and FP32 operations are processed at the same rate by the GPU. However, adding the mixed precision capability in Pascal means that the architecture will now be able to process FP16 operations twice as quickly as FP32 operations. And as mentioned above this can be of great benefit in power limited, light compute scenarios.

Nvidia Disables SLI On GTX 1060 3GB Cards

16nm FinFET Manufacturing Process Technology

TSMC’s new 16nm FinFET process promises to be significantly more power efficient than planar 28nm. It also promises to bring about a considerable improvement in transistor density. Which would enable Nvidia to build faster, significantly more complex and more power efficient GPUs.

TSMC’s 16FF+ (FinFET Plus) technology can provide above 65 percent higher speed, around 2 times the density, or 70 percent less power than its 28HPM technology. Comparing with 20SoC technology, 16FF+ provides extra 40% higher speed and 60% power saving. By leveraging the experience of 20SoC technology, TSMC 16FF+ shares the same metal backend process in order to quickly improve yield and demonstrate process maturity for time-to-market value.

Nvidia’s Proprietary High-Speed Platform Atomics Interconnect For Servers And Supercomputers – NV-Link

Pascal will also be the first Nvidia GPU to feature the company’s new NV-Link technology which Nvidia states is 5 to 12 times faster than PCIE 3.0.

The technology targets GPU accelerated servers where the cross-chip communication is extremely bandwidth limited and a major system bottleneck. Nvidia states that NV-Link will be up to 5 to 12 times faster than traditional PCIE 3.0 making it a major step forward in platform atomics. Earlier this year Nvidia announced that IBM will be integrating this new interconnect into its upcoming PowerPC server CPUs. NVLink will debut with Nvidia’s Pascal in 2016 before it makes its way to Volta in 2018.
NVLINK_4

NVLink is an energy-efficient, high-bandwidth communications channel that uses up to three times less energy to move data on the node at speeds 5-12 times conventional PCIe Gen3 x16. First available in the NVIDIA Pascal GPU architecture, NVLink enables fast communication between the CPU and the GPU, or between multiple GPUs. Figure 3: NVLink is a key building block in the compute node of Summit and Sierra supercomputers.

VOLTA GPU Featuring NVLINK and Stacked Memory NVLINK GPU high speed interconnect 80-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit.

NVLink is a key technology in Summit’s and Sierra’s server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to access each other’s memory fast and seamlessly. From a programmer’s perspective, NVLink erases the visible distinctions of data separately attached to the CPU and the GPU by “merging” the memory systems of the CPU and the GPU with a high-speed interconnect. Because both CPU and GPU have their own memory controllers, the underlying memory systems can be optimized differently (the GPU’s for bandwidth, the CPU’s for latency) while still presenting as a unified memory system to both processors. NVLink offers two distinct benefits for HPC customers. First, it delivers improved application performance, simply by virtue of greatly increased bandwidth between elements of the node. Second, NVLink with Unified Memory technology allows developers to write code much more seamlessly and still achieve high performance. via NVIDIA News


Pascal brings many new improvements to the table both in terms of hardware and software. However, the focus is crystal clear and is 100% about pushing power efficiency and compute performance higher than ever before. The plethora of new updates to the architecture and the ecosystem underline this focus.

Pascal will be the company’s first graphics architecture to use next generation stacked memory technology, HBM. It will also be the first ever to feature a brand new from the ground-up high-speed proprietary interconnect, NV-Link. Mixed precision support is also going to play a major role in introducing a step function improvement in perf/watt in mobile applications.

GPU FamilyVegaNVIDIA Pascal
Flagship GPUVega 10GP102
GPU Process14nm FinFET16nm FinFET
GPU TransistorsUp To 18 Billion12 Billion
Memory Up to 32 GB HBM212GB GDDR5X
Bandwidth1 TB/s480 GB/s
Graphics ArchitecturePolaris ( GCN 4.0 )Pascal
PredecessorFiji (Fury Series)GM200 (900 Series)

Share Tweet Submit