NVIDIA has recently posted a deep-dive of its next-generation Grace CPU Superchip which would offer up to 2.5x performance gain over AMD EPYC CPUs.
NVIDIA Shows Up To 2.5x Performance & 3.5x Efficiency Gain With Grace CPU Superchip Versus AMD EPYC Milan
NVIDIA first announced its Grace CPU and the respective Superchip design at GTC 2022. The Grace CPU is NVIDIA's first processor based on a custom Arm architecture that will be aiming at the server / HPC segment. The CPU comes in two Superchip configurations, a Grace Superchip module with two Grace CPUs and a Grace+Hopper Superchip with one Grace CPU connected to a Hopper H100 GPU.
Some of the main highlights of Grace include:
- High-performance CPU for HPC and cloud computing
- Super chip design with up to 144 Arm v9 CPU cores
- World’s first LPDDR5x with ECC Memory, 1TB/s total bandwidth
- SPECrate2017_int_base over 740 (estimated)
- 900 GB/s coherent interface, 7X faster than PCIe Gen 5
- 2X the packaging density of DIMM-based solutions
- 2X the performance per watt of today’s leading CPU
- Runs all NVIDIA software stacks and platforms, including RTX, HPC, AI, and Omniverse
|NVIDIA Grace CPU Superchip architecture features|
|Core architecture||Neoverse V2 Cores: Armv9 with 4x128b SVE2|
|Cache||L1: 64 KB I-cache + 64 KB D-cache per core L2: 1 MB per core L3: 234 MB per superchip|
|Memory technology||LPDDR5X with ECC, co-packaged|
|Raw memory BW||Up to 1 TB/s|
|Memory size||Up to 960 GB|
|FP64 peak||7.1 TFLOPS|
|PCI Express||8x PCIe Gen 5 x16 interfaces; option to bifurcate Total 1 TB/s PCIe bandwidth. Additional low-speed PCIe connectivity for management.|
|Power||500 W TDP with memory, 12 V supply|
Being NVIDIA's first server CPU, Grace features 72 Arm v9.0 cores that offer support for SVE2 and various virtualization extensions such as Nested Virtualization and S-EL2. The CPU is fabricated on TSMC's 4N process node, an optimized version of the 5nm process node which is made exclusively for NVIDIA. The new architecture can provide up to 7.1 TFLOPs of peak FP64 performance.
Grace is designed to be paired and as such, one of the most crucial aspects of the design is its C2C (Chip-To-Chip) interconnect. Grace achieves this with NVLINK which is used to make the Superchips and removes all bottlenecks that are associated with a typical cross-socket configuration.
The C2C NVLINK interconnect provides 900 GB/s of raw bi-directional bandwidth (same bandwidth as a GPU to GPU NVLINK switch on Hopper), while running at a very low power interface of just 1.3 pJ/bit or 5 times more efficient than the PCIe protocol.
The NVIDIA Grace CPU features a scalable coherency fabric with a distributed cache design. The chip has up to 3.225 TB/s of bi-section bandwidth, is scalable beyond 72 cores (144 on Superchip), integrates 117 MB of L3 cache per core or 234 MB per Superchip, and features support for Arm memory partitioning and monitoring (MPAM). Grace also allows for a unified memory architecture with shared page tables. Two NVIDIA Grace+Hopper Superchips can be interconnected together through an NVSwitch and a Grace CPU on one Superchip can directly communicate with the GPU on the other chip or even access its VRAM at native NVLINK speeds.
Getting a closer look at the memory design of Grace, NVIDIA is utilizing up to 960 GB of LPDDR5X (ECC) across 32 channels, delivering up to 1 TB/s memory bandwidth. NVIDIA states that LPDDR5X provides the best value when keeping in mind the overall bandwidth, cost, and power requirement. For example, versus DDR5, the LPDDR5X subsystem provides 53% more bandwidth at one-eighth the power per gigabyte per second and at a similar cost. Additionally, HBM2e memory could have provided more bandwidth and efficiency but at 3x the cost.
For I/O, you get 68 PCIe Gen 5.0 lanes, four of which can be used for x16 links at 128 GB/s, and the rest of the two are used for MISC. There are also 12 lanes of coherent NVLINK lanes shared with two Gen 5 PCIe x16 links.
As for TDP, the NVIDIA Grace (CPU Only) Superchip is optimized for single-core performance and offers up to 1 TB/s of memory bandwidth and a TDP of 500W for the 144-core dual chip config.
The performance figures showcased by NVIDIA put the Grace CPU Superchip up against dual-socket (2P) AMD EPYC 7763 "Milan" CPUs across various HPC workloads such as OpenFOAM, WRF, NEMO, and BWA. In OpenFOAM, the Grace CPU Superchip delivers an incredible 2.5x performance increase with up to 3.5x efficiency. On average, NVIDIA's new Grace CPU Superchip should be able to deliver a 1.9x increase in performance and a 2.57x increase in performance per watt compared to AMD's EPYC Milan CPUs. This should also lead to competitive performance against the latest server chips from AMD & Intel.
NVIDIA Grace CPU Superchip vs AMD EPYC 7763 Milan CPUs:
We have already put the numbers into perspective in a previous article which can be seen below:
NVIDIA states that its Grace is a highly specialized processor targeting workloads such as training next-generation NLP models that have more than 1 trillion parameters. When tightly coupled with NVIDIA GPUs, a Grace CPU-based system will deliver 10x faster performance than today’s state-of-the-art NVIDIA DGX-based systems, which run on x86 CPUs.
It will definitely be interesting to see how the Grace CPUs stack up against x86 chips but by the time they release, they will be competing against AMD's Genoa and Intel's Sapphire Rapids CPUs. The NVIDIA Grace CPUs are planned to be used in the ATOS supercomputer as reported here.