Intel Xe GPU Architecture Detailed – Ponte Vecchio Xe HPC Exascale GPU With 1000s of EUs, Massive HBM Memory, Rambo Cache
Intel has just unveiled the latest details of their Xe GPU architecture based products at its HPC Developer Conference. Talking at the stage, Intel's SVP, Chief Architect and General Manager of Architecture, Raja Koduri, revealed the very first architecture roadmap for Intel's first in-house graphics architecture known as Xe and the respective products lines that it would be embedded within.
Intel Details Xe GPU Architecture - Ponte Vecchio For Exascale Compute Scalable To 1000s of EUs, XEMF Scalable Memory Fabric, Rambo Cache, Forveros Packaging, 40X Increase In FP64 Compute Per EU & A lot More!
There's much to cover here so let's talk about the first aspect of the Xe GPU architecture, the lineup itself. The Intel Xe GPU architecture is one scalable architecture powering various products. Intel is planning to offer three microarchitectures derived from Xe. These include:
- Intel Xe LP (Integrated + Entry)
- Intel Xe HP (Mid-Range, Enthusiasts, Datacenter / AI)
- Intel Xe HPC (HPC Exascale)
Just from the naming scheme, you can tell where these GPUs would be a feature. The 'LP' keyword stands for Low-Power whereas te 'HP' keyword stands for High-Performance. The HPC keyword is simply the High-Performance Computing aimed architecture which would use a range of new Intel technologies that we are going to talk about. It is stated that Xe LP is around 5W-20W but can scale up to 50W. Intel's Xe HP is one tier above that and should cover the 75W-250W segment while the Xe HPC class architecture should aim even higher, delivering, even more, compute performance than the rest.
“Architecture is a software compatibility contract. We originally were planning for two microarchitectures within Xe, our architecture (LP and HP), but we saw an opportunity for a third within HPC.” - Raja Koduri
Intel Xe class GPUs would feature variable vector width as mentioned below:
- SIMT (GPU Style)
- SIMD (CPU Style)
- SIMT + SIMD (Max Performance)
Raja specifically talked about the Xe HPC class GPUs since that's what the developer conference is entirely about. Intel's Xe HPC GPUs would be able to scale to 1000s of EUs and each Execution unit has been upgraded to deliver 40 times better double-precision floating-point compute horsepower.
The EU's would be connected with a new scalable memory fabric known as XEMF (short form of XE Memory Fabric) to several high-bandwidth memory channels. The Xe HPC architecture would also include a very large unified cache known as Rambo cache which would connect several GPUs together. This Rambo cache would offer a sustainable peak FP64 compute perf throughout double-precision workloads by delivering huge memory bandwidth.
“At the heart of Xe architecture we have a new fabric called XEMF. It is the heart of the performance of these machines. We called it the Rambo Cache. It is a unified cache that is accessible to CPU and GPU memory.” - Raja Koduri
Intel will be manufacturing their Xe HPC class GPUs on the latest 7nm process node. This is also the lead 7nm product that Intel has talked about previously. Intel would make full use of their new and enhanced packaging technologies such as Forveros and EMIB interconnects to develop the next exascale GPUs. Just in terms of process optimizations, following are the few key improvements that Intel has announced for their 7nm process node over 10nm:
- 2x density scaling vs 10nm
- Planned intra-node optimizations
- 4x reduction in design rules
- Next-Gen Foveros & EMIB Packaging
The Xe HPC GPUs would be using Forveros technology to interconnect with the Rambo cache which would be shared across several other Xe HPC GPUs on the same interposer. Similarly, EMIB would be used to connect the HBM memory with the GPUs. Both technologies would deliver a huge leap in bandwidth efficiency and density. Just like their Xeon brethren, Intel's Xe HPC GPUs would come with ECC memory/cache correction and Xeon-Class RAS.
Blue Team's First HPC GPU, The 7nm Ponte Vecchio - Landing in The Aurora Supercomputer in 2021
With all the key technologies detailed, let's get straight to the first 7nm product in which Intel's Xe HPC architecture is going to be featured. It is called Ponte Vecchio, a supermassive GPU that aims to be the next single-chip exascale design for supercomputers. The Ponte Vecchio GPU would come with 16 compute chiplets which are based on the Xe HPC GPU architecture.
There seem to be massive amounts of HBM DRAM connected to each GPU. A singular node for the Aurora Supercomputer is also detailed here. We are looking at six Ponte Vecchio GPUs connected via the Intel using CXL (Compute Express Link or Intel Xe Link) with a OneAPI software stack. The node would also feature 2 Intel Sapphire Rapids processors which are based on the next-gen 10nm++ Willow Cove CPU architecture. The first confirmed product to feature the 7nm datacenter Xe based Ponte Vecchio PGPUs will be the Aurora supercomputer as detailed above. Some key features of a singular Aurora supercomputer node include:
- Leadership Performance (For HPC, Data Analytics, AI)
- Unified Memory Architecture (Across CPU & GPU)
- All-To-All Connectivity Within Node (Low Latency, High Bandwidth)
- Unparalleled I/O Scalability Across Nodes (8 Fabric Endpoints per node, DAOS)
The approach is very similar to what NVIDIA did with their NVIDIA DGX-2, stacking 16 Volta GPUs inside a singular node and connecting them through NVSwitch. But unlike Intel's plan, NVIDIA termed the entire node as a GPU while Intel is terming the 16 chiplets featured on a singular interposer a GPU. And there are six of these GPUs on a singular node. It is likely that NVIDIA will also be following the MCM (Multi-Chip-Module) chiplet design on their future HPC products such as Ampere which is expected to make debut in 2020, a year before Intel's Ponte Vecchio lands in the HPC market.
While Datacenter would be first to use 7nm Xe GPUs, Intel's 10nm Xe GPU lineup would be making its way to the mainstream and enthusiast gaming market in 2020 which would be utilizing the more consumer-tuned Xe LP and Xe HP GPU architectures.