Intel Details Ponte Vecchio GPU & Sapphire Rapids HBM Performance, Up To 2.5x Faster Than NVIDIA A100

Hassan Mujtaba • Aug 22, 2022 at 07:50am EDT

Intel representative teases the new Ponte Vecchio compute GPU for AI & HPC applications of the future

During Hot Chips 34, Intel once again detailed its Ponte Vecchio GPUs running on a Sapphire Rapids HBM server platform.

Intel Shows off Ponte Vecchio 2-Stack GPU & Sapphire Rapids HBM CPU Performance Against NVIDIA's A100

In the presentation by Intel Fellow & Chief GPU Compute Architect, Hong Jiang, we get some more details regarding the upcoming server powerhouses from the blue team. The Ponte Vecchio GPU comes in three configurations starting with a singular OAM and ranging up to an x4 Subsystem with Xe Links, either running solo or with a dual-socket Sapphire Rapids platform.

intel-ponte-vecchio-gpu-sapphire-rapids-hbm-performance-hot-chips-34_1

intel-ponte-vecchio-gpu-sapphire-rapids-hbm-performance-hot-chips-34_2

The OAM supports all-to-all topologies for both 4 GPU and 8 GPU platforms. Complementing the entire platform is Intel's oneAPI software stack which is a Level-Zero API that provides a low-level hardware interface to support cross-architecture programming. Some of the main features of the oneAPI include:

Interface for oneAPI and other tools to accelerator devices
Fine gain control and low-latency to accelerator capabilities
Multi-Threaded Design
For GPUs, ships as a part of the driver

So coming to the performance metrics, a 2-Stack Ponte Vecchio GPU configuration like the one featured on a singular OAM is capable of delivering up to 52 TFLOPs of FP64/FP32 compute, 419 TFLOPs of TF32 (XMX Float 32), 839 TFLOPs of BF16/FP16 and 1678 TFLOPs of INT8 horsepower.

intel-ponte-vecchio-gpu-sapphire-rapids-hbm-performance-hot-chips-34_9

intel-ponte-vecchio-gpu-sapphire-rapids-hbm-performance-hot-chips-34_10

Intel also details its maximum cache sizes and the peak bandwidth offered by each of them. The Register File size on Ponte Vecchio GPU is 64 MB and offers 419 TB/s of bandwidth, the L1 cache also comes in at 64 MB and offers 105 TB/s (4:1), and the L2 cache comes in at 408 MB and offers 13 TB/s bandwidth (8:1) while the HBM memory pools up to 128 GB and offers 4.2 TB/s bandwidth (4:1). There is a range of compute efficiency techniques within Ponte Vecchio such as:

Register File:

Register Caching
Accumulators

L1/L2 Cache:

Write Through
Write Back
Write Streaming
Uncached

Prefetch:

Software (instruction) prefetch to L1 and/ or L2
Command Streamer prefetch to L2 for instruction and data

Intel explains that the larger L2 cache can deliver some huge gains in workloads such as 2D-FFT Case and DNN Case. Some performance comparisons between a full Ponte Vecchio GPU and a module down-configured to 80 MB and 32 MB have been shown.

intel-ponte-vecchio-gpu-sapphire-rapids-hbm-performance-hot-chips-34_16

intel-ponte-vecchio-gpu-sapphire-rapids-hbm-performance-hot-chips-34_17

But that's not all, Intel also has performance comparisons between the NVIDIA Ampere A100 running CUDA and SYCL against its own Ponte Vecchio GPUs using SYCL. In miniBUDE, which is a computational workload that can predict the binding energy of the ligand with the target, the Ponte Vecchio GPU simulates the test results 2 times faster than Ampere A100. There's another performance metric in ExaSMR (Small Modular Reactors for large nuclear reactor designs). here, the Intel GPU is shown to offer a 1.5x performance lead over the NVIDIA GPU.

It is a bit interesting that Intel is still comparing its Ponte Vecchio GPUs to Ampere A100 because the green team has since launched its next-gen Hopper H100 to the market and it's already been shipping to customers. If Chipzilla feels so confident within its 2-2.5x performance figures, then I don't think it will have any trouble competing well with Hopper unless otherwise.

Here's Everything We Know About The Intel 7 Powered Ponte Vecchio GPUs

Moving over to the Ponte Vecchio specs Intel outlined some key features of its flagship data center GPU such as 128 Xe cores, 128 RT units, HBM2e memory, and a total of 8 Xe-HPC GPUs that will be connected together. The chip will feature up to 408 MB of L2 cache in two separate stacks that will connect via the EMIB interconnect. The chip will feature multiple dies based on Intel's own 'Intel 7' process and TSMC's N7 / N5 process nodes.

Intel also previously detailed the package and die size of its flagship Ponte Vecchio GPU based on the Xe-HPC architecture. The chip will consist of 2 tiles with 16 active dies per stack. The maximum active top die size is going to be 41mm2 while the base die size which is also referred to as the 'Compute Tile' sits at 650mm2. We have all the chiplets and process nodes that the Ponte Vecchio GPUs will utilize, listed below:

Intel 7nm
TSMC 7nm
Foveros 3D Packaging
EMIB
10nm Enhanced Super Fin
Rambo Cache
HBM2

Following is how Intel gets to 47 tiles on the Ponte Vecchio chip:

16 Xe HPC (internal/external)
8 Rambo (internal)
2 Xe Base (internal)
11 EMIB (internal)
2 Xe Link (external)
8 HBM (external)

The Ponte Vecchio GPU makes use of 8 HBM 8-Hi stacks and contains a total of 11 EMIB interconnects. The whole Intel Ponte Vecchio package would measure 4843.75mm2. It is also mentioned that the bump pitch for Meteor Lake CPUs using High-Density 3D Forveros packaging will be 36u.

The Ponte Vecchio GPU is not 1 chip but a combination of several chips. It's a chiplet powerhouse, packing the most chiplets on any GPU/CPU out there, 47 to be precise. And these are not based on just one process node but several process nodes as we had detailed just a few days back.

Although the Aurora Supercomputer in which the Ponte Vecchio GPUs and Sapphire Rapids CPUs were to be used has been pushed back due to several delays by the blue team, it is still good to see the company offering more details. Intel has since teased its next-generation Rialto Bridge GPU as the successor to the Ponte Vecchio GPUs and is said to begin sampling in 2023. You can read more details on that here.

Next-Gen Data Center GPU Accelerators

GPU Name	AMD Instinct MI250X	NVIDIA Hopper GH100	Intel Ponte Vecchio	Intel Rialto Bridge
Packaging Design	MCM (Infinity Fabric)	Monolithic	MCM (EMIB + Foveros)	MCM (EMIB + Foveros)
GPU Architecture	Aldebaran (CDNA 2)	Hopper GH100	Xe-HPC	Xe-HPC
GPU Process Node	6nm	4N	7nm (Intel 4)	5nm (Intel 3)?
GPU Cores	14,080	16,896	16,384 ALUs (128 Xe Cores)	20,480 ALUs (160 Xe Cores)
GPU Clock Speed	1700 MHz	~1780 MHz	TBA	TBA
L2 / L3 Cache	2 x 8 MB	50 MB	2 x 204 MB	TBA
FP16 Compute	383 TOPs	2000 TFLOPs	TBA	TBA
FP32 Compute	95.7 TFLOPs	1000 TFLOPs	~45 TFLOPs (A0 Silicon)	TBA
FP64 Compute	47.9 TFLOPs	60 TFLOPs	TBA	TBA
Memory Capacity	128 GB HBM2E	80 GB HBM3	128 GB HBM2e	128 GB HBM3?
Memory Clock	3.2 Gbps	3.2 Gbps	TBA	TBA
Memory Bus	8192-bit	5120-bit	8192-bit	8192-bit
Memory Bandwidth	3.2 TB/s	3.0 TB/s	~3 TB/s	~3 TB/s
Form Factor	OAM	OAM	OAM	OAM v2
Cooling	Passive Cooling Liquid Cooling	Passive Cooling Liquid Cooling	Passive Cooling Liquid Cooling	Passive Cooling Liquid Cooling
TDP	560W	700W	600W	800W
Launch	Q4 2021	2H 2022	2022?	2024?

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Read all comments on Intel Details Ponte Vecchio GPU & Sapphire Rapids HBM Performance, Up To 2.5x Faster Than NVIDIA A100

Intel Details Ponte Vecchio GPU & Sapphire Rapids HBM Performance, Up To 2.5x Faster Than NVIDIA A100

Intel Shows off Ponte Vecchio 2-Stack GPU & Sapphire Rapids HBM CPU Performance Against NVIDIA's A100

Here's Everything We Know About The Intel 7 Powered Ponte Vecchio GPUs

Next-Gen Data Center GPU Accelerators

Trending Stories

Xbox Studio Leaders Reportedly Detest Game Pass, Arguing it Destroyed the Value of Their $40+ Games Now Available for Pennies

Over 80% Of Samsung Foundry Workers Are Planning To Leave Amid A Yawning Pay Gap With The Memory Division

CXMT Supply Chain To Witness Major Process Transition To Seize DDR6 Opportunity Before Commercialization, Threatening Samsung’s And SK hynix’s Global Hold

An Anti-Apple Consumer Who Laughed At MacBook Prices And Lack Of Customizations Has “Hit Rock Bottom,” Saying The Windows Laptop Market Has Been A “Nightmare”

Snapdragon 8 Elite Gen 6 Pro Could Be A Worthy Choice For Gaming Handhelds As Qualcomm’s Flagship SoC Produces Convincing Results Over Ryzen AI Z2 Extreme

Popular Discussions

AMD Medusa Point 10-Core “Zen 6” CPU Beats Strix Point 10-Core “Zen 5” By Nearly 35% While Operating at 5.4 GHz

AMD Radeon Drivers Silently Add Multi Frame Generation “MFG 8x”, Ray Regeneration, and Neural Radiance Overrides, Hinting At A Bigger FSR Push

AMD Ryzen 7 7700X3D 4.5 GHz “3D V-Cache” CPU Review: The Budget X3D Champ For AM5

NVIDIA GeForce RTX 50 SUPER GPUs Have Reportedly Arrived At AIBs, But Are On Hold Due To Undecided Memory Prices

AMD Ryzen 7 5800X3D Outsells Ryzen 7 7800X3D For The Same Price On Amazon Despite Being Weaker

Intel Details Ponte Vecchio GPU & Sapphire Rapids HBM Performance, Up To 2.5x Faster Than NVIDIA A100

Intel Shows off Ponte Vecchio 2-Stack GPU & Sapphire Rapids HBM CPU Performance Against NVIDIA's A100

Related Story Intel’s Former CEO Gelsinger Admits Firm ‘Scoffed’ at NVIDIA’s GPUs While Riding High on CPU Dominance & Makes Big Quantum Computing Claims

Here's Everything We Know About The Intel 7 Powered Ponte Vecchio GPUs

Next-Gen Data Center GPU Accelerators

Further Reading

Trending Stories

Popular Discussions