Hardware PC

Nvidia Unveils Pascal Tesla P100 With Over 20 TFLOPS Of FP16 Performance – Powered By GP100 GPU With 15 Billion Transistors & 16GB Of HBM2

Khalid Moammer

• Apr 5, 2016 at 01:52pm EDT

Nvidia has just unveiled its fastest GPU yet here at GTC 2016, a brand new graphics chip based on the company's next generation Pascal architecture. The GP100 is NVIDIA's most advanced GPU to date, powering the company's next generation compute monster, the Tesla P100.

Nvidia claims that GP100 is the largest FinFET GPU that has ever been made, measuring at 600mm² and packing over 15 billion transistors. The Tesla P100 features a slightly cut back GP100 GPU and delivers 5.3 teraflops of double precision compute, 10.6 TFLOPS of single precision compute and 21.2 TFLOPS of half precision FP16 compute. Keeping this massive GPU fed is 4MB of L2 cache and a whopping 14MB worth of register files.

The entire Telsa P100 package is comprised of many chips not just the GPU, that collectively add up to over 150 billion transistors and features 16GB of stacked HBM2 VRAM for a total of 720GB/s of bandwidth. Nvidia's CEO & Co-Founder Jen-Hsun Huang confirmed that this behemoth of a graphics card is already in volume production with samples already delivered to customers which will begin announcing their products in Q4 and will be shipping their products in Q1 2017.

Pascal GP100 Architecture & Specs

Nvidia Press Release

Five Architectural Breakthroughs
The Tesla P100 delivers its unprecedented performance, scalability and programming efficiency based on five breakthroughs:

NVIDIA Pascal architecture for exponential performance leap -- A Pascal-based Tesla P100 solution delivers over a 12x increase in neural network training performance compared with a previous-generation NVIDIA Maxwell™-based solution.
NVIDIA NVLink for maximum application scalability -- The NVIDIA NVLink™ high-speed GPU interconnect scales applications across multiple GPUs, delivering a 5x acceleration in bandwidth compared to today's best-in-class solution¹. Up to eight Tesla P100 GPUs can be interconnected with NVLink to maximize application performance in a single node, and IBM has implemented NVLink on its POWER8 CPUs for fast CPU-to-GPU communication.
16nm FinFET for unprecedented energy efficiency -- With 15.3 billion transistors built on 16 nanometer FinFET fabrication technology, the Pascal GPU is the world's largest FinFET chip ever built². It is engineered to deliver the fastest performance and best energy efficiency for workloads with near-infinite computing needs.
CoWoS with HBM2 for big data workloads -- The Pascal architecture unifies processor and data into a single package to deliver unprecedented compute efficiency. An innovative approach to memory design, Chip on Wafer on Substrate (CoWoS) with HBM2, provides a 3x boost in memory bandwidth performance, or 720GB/sec, compared to the Maxwell architecture.
New AI algorithms for peak performance -- New half-precision instructions deliver more than 21 teraflops of peak performance for deep learning.

The GP100 GPU is comprised of 3840 CUDA cores, 240 texture units and a 4096bit memory interface. The 3840 CUDA cores are arranged in six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors. As mentioned earlier in the article the Tesla P100 features a cut down GP100 GPU. This cut back version has 3584 CUDA cores and 224 texture mapping units.

Each Pascal streaming multiprocessor includes 64 FP32 CUDA cores, half that of Maxwell. Within each Pascal streaming multirprocessor there are two 32 CUDA core partitions, two dispatch units, a warp scheduler and a fairly large instruction buffer, matching that of Maxwell.

The massive GP100 GPU has significantly more pascal streaming multiprocessors, or CUDA core blocks. Because each of these has access to a register file that's the same size of Maxwell's 128 CUDA core SMM. This means that each Pascal CUDA core has access to twice the register files. In turn we should expect even more performance out of each Pascal CUDA cores compared to Maxwell.

Nvidia Press Release

Tesla P100 Specifications
Specifications of the Tesla P100 GPU accelerator include:

5.3 teraflops double-precision performance, 10.6 teraflops single-precision performance and 21.2 teraflops half-precision performance with NVIDIA GPU BOOST™ technology
160GB/sec bi-directional interconnect bandwidth with NVIDIA NVLink
16GB of CoWoS HBM2 stacked memory
720GB/sec memory bandwidth with CoWoS HBM2 stacked memory
Enhanced programmability with page migration engine and unified memory
ECC protection for increased reliability
Server-optimized for highest data center throughput and reliability

Tesla P100 Boosts To Nearly 1.5Ghz

Perhaps one of the most exciting, yet perhaps predictable, revaluations about the GP100 Pascal flagship GPU is that it can achieve clocks even higher than Maxwell. Despite Nvidia opting for very conservative clock speeds on its professional GPUs like the Tesla & Quadro products the P100 actually has a base clock speed of 1328mhz and a boost clock speed of 1480mhz. Considering that GPU Boost 2.0 actually allows these cards to operate at even higher clock speeds than the nominal boost clock.

We're looking at actual frequencies of upwards of 1500Mhz on the GeForce equivalent of the P100. What is inevitably going to launch as the next GTX Titan. This means boost clocks of even upwards of 1600Mhz on factory overclocked models, and perhaps 2Ghz+ manual overclocks. This should be extremely exciting news to all GeForce fans.

Tesla Products	Tesla K40	Tesla M40	Tesla P100
GPU	GK110 (Kepler)	GM200 (Maxwell)	GP100 (Pascal)
SMs	15	24	56
TPCs	15	24	28
FP32 CUDA Cores / SM	192	128	64
FP32 CUDA Cores / GPU	2880	3072	3584
FP64 CUDA Cores / SM	64	4	32
FP64 CUDA Cores / GPU	960	96	1792
Base Clock	745 MHz	948 MHz	1328 MHz
GPU Boost Clock	810/875 MHz	1114 MHz	1480 MHz
Compute Performance - FP32	5.04 TFLOPS	6.82 TFLOPS	10.6 TFLOPS
Compute Performance - FP64	1.68 TFLOPS	0.21 TFLOPS	5.3 TFLOPS
Texture Units	240	192	224
Memory Interface	384-bit GDDR5	384-bit GDDR5	4096-bit HBM2
Memory Size	Up to 12 GB	Up to 24 GB	16 GB
L2 Cache Size	1536 KB	3072 KB	4096 KB
Register File Size / SM	256 KB	256 KB	256 KB
Register File Size / GPU	3840 KB	6144 KB	14336 KB
TDP	235 Watts	250 Watts	300 Watts
Transistors	7.1 billion	8 billion	15.3 billion
GPU Die Size	551 mm²	601 mm²	610 mm²
Manufacturing Process	28-nm	28-nm	16-nm

Nvidia Pascal - 2X Perf/Watt With 16nm FinFET, Stacked Memory ( HBM2 ), NV-Link And Mixed Precision Compute

There are four hallmark technologies for the Pascal generation of GPUs. Namely HBM, mixed precision compute, NV-Link and the smaller, more power efficient TSMC 16nm FinFET manufacturing process. Each is very important in its own right and as such we're going to break down everyone of these four separately.

nvidia-pascal-gpu_gtc_performance-per-watt

Pascal To Be Nvidia's First Graphics Architecture To Feature High Bandwidth Memory HBM

Stacked memory will debut on the green side with Pascal. HBM Gen2 more precisely, the second generation of the SK Hynix AMD co-developed high bandwidth JEDEC memory standard. The new memory will enable memory bandwidth to exceed 1 Terabyte/s which is 3X the bandwidth of the Titan X. The new memory standard will also allow for a huge increase in memory capacities, 2.7X the memory capacity of Maxwell to be precise. Which indicates that the new Pascal flagship will feature 32GB of video memory, a mind-bogglingly huge number.

We've already seen AMD take advantage of HBM memory technology with its Fiji XT GPU last year. Which features 512GB/S of memory bandwidth, twice that of the GTX 980. AMD has also announced last month at its Capsaicin event that it will be bringing HBM2 with its next generation Vega architecture, succeeding its 14nm FinFET Polaris architecture launching this summer with GDDR5 memory.

TSMC’s new 16nm FinFET process promises to be significantly more power efficient than planar 28nm. It also promises to bring about a considerable improvement in transistor density. Which would enable Nvidia to build faster, significantly more complex and more power efficient GPUs.

Pascal Is Nvidia's First Graphics Architecture To Deliver Half Precision Compute FP16 At Double The Rate Of Full Precision FP32

One of the more significant features that was revealed for Pascal was the addition of 16FP compute support, otherwise known as mixed precision compute or half precision compute. At this mode the accuracy of the result to any computational problem is significantly lower than the standard 32FP method, which is required for all major graphics programming interfaces in games and has been for more than a decade. This includes DirectX 12, 11, 10 and DX9 Shader model 3.0 which debuted almost a decade ago. This makes mixed precision mode unusuable for any modern gaming application.

However due to its very attractive power efficiency advantages over FP32 and FP64 it can be used in scenarios where a high degree of computational precision isn't necessary. Which makes mixed precision computing especially useful on power limited mobile devices. Nvidia's Maxwell GPU architecture feature in the GTX 900 series of GPUs is limited to FD32 operations, this in turn means that FP16 and FP32 operations are processed at the same rate by the GPU. However, adding the mixed precision capability in Pascal means that the architecture will now be able to process FP16 operations twice as quickly as FP32 operations. And as mentioned above this can be of great benefit in power limited, light compute scenarios.

16nm FinFET Manufacturing Process Technology

TSMC’s 16FF+ (FinFET Plus) technology can provide above 65 percent higher speed, around 2 times the density, or 70 percent less power than its 28HPM technology. Comparing with 20SoC technology, 16FF+ provides extra 40% higher speed and 60% power saving. By leveraging the experience of 20SoC technology, TSMC 16FF+ shares the same metal backend process in order to quickly improve yield and demonstrate process maturity for time-to-market value.

Nvidia's Proprietary High-Speed Platform Atomics Interconnect For Servers And Supercomputers - NV-Link

Pascal will also be the first Nvidia GPU to feature the company's new NV-Link technology which Nvidia states is 5 to 12 times faster than PCIE 3.0.

The technology targets GPU accelerated servers where the cross-chip communication is extremely bandwidth limited and a major system bottleneck. Nvidia states that NV-Link will be up to 5 to 12 times faster than traditional PCIE 3.0 making it a major step forward in platform atomics. Earlier this year Nvidia announced that IBM will be integrating this new interconnect into its upcoming PowerPC server CPUs. NVLink will debut with Nvidia’s Pascal in 2016 before it makes its way to Volta in 2018.

NVLink is an energy-efficient, high-bandwidth communications channel that uses up to three times less energy to move data on the node at speeds 5-12 times conventional PCIe Gen3 x16. First available in the NVIDIA Pascal GPU architecture, NVLink enables fast communication between the CPU and the GPU, or between multiple GPUs. Figure 3: NVLink is a key building block in the compute node of Summit and Sierra supercomputers.

VOLTA GPU Featuring NVLINK and Stacked Memory NVLINK GPU high speed interconnect 80-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit.

NVLink is a key technology in Summit’s and Sierra’s server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to access each other’s memory fast and seamlessly. From a programmer’s perspective, NVLink erases the visible distinctions of data separately attached to the CPU and the GPU by “merging” the memory systems of the CPU and the GPU with a high-speed interconnect. Because both CPU and GPU have their own memory controllers, the underlying memory systems can be optimized differently (the GPU’s for bandwidth, the CPU’s for latency) while still presenting as a unified memory system to both processors. NVLink offers two distinct benefits for HPC customers. First, it delivers improved application performance, simply by virtue of greatly increased bandwidth between elements of the node. Second, NVLink with Unified Memory technology allows developers to write code much more seamlessly and still achieve high performance. via NVIDIA News

Pascal brings many new improvements to the table both in terms of hardware and software. However, the focus is crystal clear and is 100% about pushing power efficiency and compute performance higher than ever before. The plethora of new updates to the architecture and the ecosystem underline this focus.

Pascal will be the company's first graphics architecture to use next generation stacked memory technology, HBM. It will also be the first ever to feature a brand new from the ground-up high-speed proprietary interconnect, NV-Link. Mixed precision support is also going to play a major role in introducing a step function improvement in perf/watt in mobile applications.

GPU Family	Vega	NVIDIA Pascal
Flagship GPU	Vega 10	GP102
GPU Process	14nm FinFET	16nm FinFET
GPU Transistors	Up To 18 Billion	12 Billion
Memory	Up to 16 GB HBM2	12GB GDDR5X
Bandwidth	512 GB/s	480 GB/s
Graphics Architecture	Vega (NCU)	Pascal
Predecessor	Fiji (Fury Series)	GM200 (900 Series)

About the author: PC hardware & tech evangelist. Been building PCs for over a decade & following the industry for just as long. Also a doctor specializing in Preventive Medicine.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Deal of the Day

Read all comments on Nvidia Unveils Pascal Tesla P100 With Over 20 TFLOPS Of FP16 Performance – Powered By GP100 GPU With 15 Billion Transistors & 16GB Of HBM2

Nvidia Unveils Pascal Tesla P100 With Over 20 TFLOPS Of FP16 Performance – Powered By GP100 GPU With 15 Billion Transistors & 16GB Of HBM2

Pascal GP100 Architecture & Specs

Tesla P100 Boosts To Nearly 1.5Ghz

Nvidia Pascal - 2X Perf/Watt With 16nm FinFET, Stacked Memory ( HBM2 ), NV-Link And Mixed Precision Compute

Pascal To Be Nvidia's First Graphics Architecture To Feature High Bandwidth Memory HBM

Pascal Is Nvidia's First Graphics Architecture To Deliver Half Precision Compute FP16 At Double The Rate Of Full Precision FP32

16nm FinFET Manufacturing Process Technology

Nvidia's Proprietary High-Speed Platform Atomics Interconnect For Servers And Supercomputers - NV-Link

Deal of the Day

Trending Stories

After Earning Major Profits From HBM, SK Hynix Now Plans To Prioritize DDR5 “General-Purpose DRAM” Production

Apple Ran Out Of Patience In Arranging An M2 Max MacBook Pro Replacement Part, Customer Receives M5 Max Replacement Plus Store Credit

Report: TSMC Follows Samsung and SK Hynix Into Price Surge, Catching Its Own Customers Off Guard With 7nm Hikes

Tesla’s Optimus Humanoid Robot Mass Production Nears As Taiwanese Suppliers Gear Up To Provide Components – Report

Dragon Ball Xenoverse 3 Splits Development 50/50 Between Single-Player Story and Online Sandbox, Producer Tells Wccftech

Popular Discussions

Apple To Design & Build Chips At Intel on American Soil, US President Confirms

NVIDIA Blackwell Sweeps Every MLPerf 6.0 Benchmark With No Competition In Sight, While GB300 Systems Run Up to 60% Faster Than GB200

AMD Reportedly Plots Another 10-15% RX 9000 Price Hike As The RAMpocalypse Swallows The GPU Market

AMD’s Next-Gen Threadripper “Mustang Peak” Confirmed: Built For TR6 Platform, Bringing 2nm Zen 6 Cores and PCIe Gen6

AMD Rolls Out FSR 4.1 For RX 7000 GPUs, Builds a Lightweight ML Model for RDNA 3.5 and RDNA 3 iGPUs

Nvidia Unveils Pascal Tesla P100 With Over 20 TFLOPS Of FP16 Performance – Powered By GP100 GPU With 15 Billion Transistors & 16GB Of HBM2

Pascal GP100 Architecture & Specs

Tesla P100 Boosts To Nearly 1.5Ghz

Nvidia Pascal - 2X Perf/Watt With 16nm FinFET, Stacked Memory ( HBM2 ), NV-Link And Mixed Precision Compute

Pascal To Be Nvidia's First Graphics Architecture To Feature High Bandwidth Memory HBM

Pascal Is Nvidia's First Graphics Architecture To Deliver Half Precision Compute FP16 At Double The Rate Of Full Precision FP32

16nm FinFET Manufacturing Process Technology

Nvidia's Proprietary High-Speed Platform Atomics Interconnect For Servers And Supercomputers - NV-Link

Deal of the Day

Further Reading

Trending Stories

Popular Discussions