Intel Showcases Its Packaging Prowess With 7nm Ponte Vecchio Xe-HPC GPU, Over 100 Billion Transistors & 47 XPU Compute Tiles
Yesterday, during the Intel unleashed webcast, CEO, Pat Gelsinger, unveiled new details of the 7nm Xe-HPC-based Ponte Vecchio GPU which is planned to be the largest and most chip designed to date. The Ponte Vecchio GPU will be making use of several key technologies that were highlighted which will power 47 different compute tiles based on different process nodes and architectures.
Intel 7nm Xe-HPC Powered Ponte Vecchio GPU Further Detailed - Over 100 Billion Transistors, 47 XPU Tiles & Mix-Match of Various Process Nodes
The Intel Ponte Vecchio GPU is first and foremost based on the Xe-HPC graphics architecture which is the flagship product leveraging Intel's 7nm EUV process node. But aside from that, the chip has a ton of other compute tiles that are based on different process nodes, all of which merge into one singular exascale graphics processing unit known as Ponte Vecchio. We already gave a run-down of what the complete Ponte Vecchio GPU has to offer and you can read a more detailed post on that here.
So for starters, while the GPU primarily makes use of Intel's 7nm EUV process node but Intel will also be producing some Xe-HPC compute dies through external fabs (such as TSMC), There are other tiles that are essential for the Ponte Vecchio GPU to work and those are fabricated on TSMC's 7nm process node. We cannot confirm yet if Intel will be leveraging TSMC's standard 7nm or 7nm+ EUV process node but it is likely that Intel could go the more standard route since the Xe Link I/O tile that will be using TSMC's process can do the job while being based on the non-EUV 7nm process.
Raja teased that there are 7 advanced technologies at play here, and by our calculation, these would be:
- Intel 7nm
- TSMC 7nm
- Foveros 3D Packaging
- 10nm Enhanced Super Fin
- Rambo Cache
Following is how Intel gets to 47 tiles on the Ponte Vecchio chip:
- 16 Xe HPC (internal/external)
- 8 Rambo (internal)
- 2 Xe Base (internal)
- 11 EMIB (internal)
- 2 Xe Link (external)
- 8 HBM (external)
The Ponte Vecchio chip is actually composed of two separate GPU dies, each consisting of six Xe-HPC Compute units. A pair of these Compute Units are directly attached to a Rambo Cache which utilizes the Intel 10nm Enhanced SuperFin process node. Each GPU block is also attached to four HBM2 stacks which could either be 4-hi or 8-hi. There are eight HBM2 stacks in total that will offer multi-GBs of memory capacity with loads of bandwidth. There are also 8 passive die stiffeners on each GPU. The main GPU makes use of Foveros 3D packaging to connect the GPU compute units with the cache while the EMIB interconnects the HBM2 and Xe Link I/O tile with the main GPU. The GPU also makes use of 11 EMIB dies that are featured underneath the HBM2 and I/O link chips.
In general, Forveros offers inter-GPU connectivity (GPU + Cache) within the same tiles while EMIB offers connectivity for off-die tiles (HBM2 with GPU). This all cumulates to form the Ponte Vecchio Xe-HPC GPU which is composed of over 100 Billion transistors. An interesting lego block diagram was posted by Raja Koduri which shows the various blocks/tiles of the Ponte Vecchio GPU but we also have the more detailed block diagram posted above which provides you an exact illustration of what each tile is.
— Raja Koduri (@Rajaontheedge) March 24, 2021
Andreas, in Jan we didn't account the HBM's as individual tiles. That's the main difference.
16 Xe HPC (internal/external)
8 Rambo (internal)
2 Xe Base (internal)
11 EMIB (internal)
2 Xe Link (external)
8 HBM (external) https://t.co/uA0jAs8QDo
— Raja Koduri (@Rajaontheedge) March 24, 2021
Intel Xe HPC 'Ponte Vecchio' GPU - What We Know So Far
So rounding up the details, the Intel Xe HPC 'Ponte Vecchio' GPUs will be the lead 7nm product arriving in 2021. it will feature an MCM package design based on the Foveros 3D packaging technology. Each MCM GPU will be connected to high-density HBM DRAM packages through EMIB & will additionally feature a faster Rambo Cache close to them which will be connected through Foveros. Finally, while Slingshot provides an interconnect between the nodes, Intel's Xe Link will be interconnecting the 6 Xe HPC GPUs together.
Intel has previously detailed that its Xe HPC GPUs will feature 1000s of EUs. So far, we have only seen Xe LP with 96 EUs which makes up for a total of 768 cores. Currently, Intel features 8 EUs per subslice. A subslice within a Gen 12 GPU is similar to the NVIDIA SM unit inside the GPC or an AMD CU within the Shader Engine. Intel currently features 8 EUs per subslice on its Gen 9.5 and Gen 11 GPUs so if the same hierarchy is kept, we can see a significant amount of Super-Slices consisting of many subslices. Each Gen 11 and Gen 9.5 EU also contain 8 ALUs which will remain the same on Gen 12 too from the looks of it.
Rounding it up, A 1000 EU chip will make up for 8000 cores but it has been confirmed that 1000 is just the base value and the actual core count is much bigger than that. A 4-tile Xe HP GPU with 2048 EUs or 16,384 cores has already been detailed so it's likely that HPC parts will be much bigger than that. Here are the actual EU counts of Intel's various MCM-based Xe HP GPUs along with estimated core counts and TFLOPs:
- Intel Xe HP (12.5) 1-Tile GPU: 512 EU [Est: 4096 Cores, 12.2 TFLOPs assuming 1.5GHz, 150W]
- Intel Xe HP (12.5) 2-Tile GPU: 1024 EUs [Est: 8192 Cores, 20.48 assuming 1.25 GHz, TFLOPs, 300W]
- Intel Xe HP (12.5) 4-Tile GPU: 2048 EUs [Est: 16,384 Cores, 36 TFLOPs assuming 1.1 GHz, 400W/500W]
Intel Xe class GPUs would feature variable vector width as mentioned below:
- SIMT (GPU Style)
- SIMD (CPU Style)
- SIMT + SIMD (Max Performance)
Raja specifically talked about the Xe HPC class GPUs since that's what the developer conference is entirely about. Intel's Xe HPC GPUs would be able to scale to 1000s of EUs and each Execution unit has been upgraded to deliver 40 times better double-precision floating-point compute horsepower.
The EU's would be connected with a new scalable memory fabric known as XEMF (short form of XE Memory Fabric) to several high-bandwidth memory channels. The Xe HPC architecture would also include a very large unified cache known as Rambo cache which would connect several GPUs together. This Rambo cache would offer a sustainable peak FP64 compute perf throughout double-precision workloads by delivering huge memory bandwidth.
Just in terms of process optimizations, the following are the few key improvements that Intel has announced for their 7nm process node over 10nm:
- 2x density scaling vs 10nm
- Planned intra-node optimizations
- 4x reduction in design rules
- Next-Gen Foveros & EMIB Packaging
The Xe HPC GPUs would be using Forveros technology to interconnect with the Rambo cache which would be shared across several other Xe HPC GPUs on the same interposer. Just like their Xeon brethren, Intel's Xe HPC GPUs would come with ECC memory/cache correction and Xeon-Class RAS. Intel's Ponte Vecchio GPUs will be heading out first to the Aurora supercomputer with shipments beginning later this year. The GPU will compete against NVIDIA's Ada Lovelace and AMD's CDNA 2 graphics architectures in the HPC segment which is also going to be utilizing a multi-die design approach.