NVIDIA Grace CPU Superchip Benchmarks Show 2.5x Performance & 3.5x Efficiency Gain Over AMD EPYC Milan CPUs

Hassan Mujtaba • Jan 20, 2023 at 06:00am EST

NVIDIA Open To Releasing Arm-Based Grace CPU As A Standalone Product For Servers 1

NVIDIA has recently posted a deep-dive of its next-generation Grace CPU Superchip which would offer up to 2.5x performance gain over AMD EPYC CPUs.

NVIDIA Shows Up To 2.5x Performance & 3.5x Efficiency Gain With Grace CPU Superchip Versus AMD EPYC Milan

NVIDIA first announced its Grace CPU and the respective Superchip design at GTC 2022. The Grace CPU is NVIDIA's first processor based on a custom Arm architecture that will be aiming at the server / HPC segment. The CPU comes in two Superchip configurations, a Grace Superchip module with two Grace CPUs and a Grace+Hopper Superchip with one Grace CPU connected to a Hopper H100 GPU.

Some of the main highlights of Grace include:

High-performance CPU for HPC and cloud computing
Super chip design with up to 144 Arm v9 CPU cores
World’s first LPDDR5x with ECC Memory, 1TB/s total bandwidth
SPECrate2017_int_base over 740 (estimated)
900 GB/s coherent interface, 7X faster than PCIe Gen 5
2X the packaging density of DIMM-based solutions
2X the performance per watt of today’s leading CPU
Runs all NVIDIA software stacks and platforms, including RTX, HPC, AI, and Omniverse

	NVIDIA Grace CPU Superchip architecture features
Core architecture	Neoverse V2 Cores: Armv9 with 4x128b SVE2
Core count	144
Cache	L1: 64 KB I-cache + 64 KB D-cache per core L2: 1 MB per core L3: 234 MB per superchip
Memory technology	LPDDR5X with ECC, co-packaged
Raw memory BW	Up to 1 TB/s
Memory size	Up to 960 GB
FP64 peak	7.1 TFLOPS
PCI Express	8x PCIe Gen 5 x16 interfaces; option to bifurcate Total 1 TB/s PCIe bandwidth. Additional low-speed PCIe connectivity for management.
Power	500 W TDP with memory, 12 V supply

Being NVIDIA's first server CPU, Grace features 72 Arm v9.0 cores that offer support for SVE2 and various virtualization extensions such as Nested Virtualization and S-EL2. The CPU is fabricated on TSMC's 4N process node, an optimized version of the 5nm process node which is made exclusively for NVIDIA. The new architecture can provide up to 7.1 TFLOPs of peak FP64 performance.

Grace is designed to be paired and as such, one of the most crucial aspects of the design is its C2C (Chip-To-Chip) interconnect. Grace achieves this with NVLINK which is used to make the Superchips and removes all bottlenecks that are associated with a typical cross-socket configuration.

The C2C NVLINK interconnect provides 900 GB/s of raw bi-directional bandwidth (same bandwidth as a GPU to GPU NVLINK switch on Hopper), while running at a very low power interface of just 1.3 pJ/bit or 5 times more efficient than the PCIe protocol.

nvidia-grace-cpu-superchips-_-hot-chips-34-_3

nvidia-grace-cpu-superchips-_-hot-chips-34-_4

The NVIDIA Grace CPU features a scalable coherency fabric with a distributed cache design. The chip has up to 3.225 TB/s of bi-section bandwidth, is scalable beyond 72 cores (144 on Superchip), integrates 117 MB of L3 cache per core or 234 MB per Superchip, and features support for Arm memory partitioning and monitoring (MPAM). Grace also allows for a unified memory architecture with shared page tables. Two NVIDIA Grace+Hopper Superchips can be interconnected together through an NVSwitch and a Grace CPU on one Superchip can directly communicate with the GPU on the other chip or even access its VRAM at native NVLINK speeds.

nvidia-grace-cpu-superchips-_-hot-chips-34-_9

nvidia-grace-cpu-superchips-_-hot-chips-34-_10

Getting a closer look at the memory design of Grace, NVIDIA is utilizing up to 960 GB of LPDDR5X (ECC) across 32 channels, delivering up to 1 TB/s memory bandwidth. NVIDIA states that LPDDR5X provides the best value when keeping in mind the overall bandwidth, cost, and power requirement. For example, versus DDR5, the LPDDR5X subsystem provides 53% more bandwidth at one-eighth the power per gigabyte per second and at a similar cost. Additionally, HBM2e memory could have provided more bandwidth and efficiency but at 3x the cost.

For I/O, you get 68 PCIe Gen 5.0 lanes, four of which can be used for x16 links at 128 GB/s, and the rest of the two are used for MISC. There are also 12 lanes of coherent NVLINK lanes shared with two Gen 5 PCIe x16 links.

nvidia-grace-cpu-superchips-_-hot-chips-34-_11

nvidia-grace-cpu-superchips-_-hot-chips-34-_12

As for TDP, the NVIDIA Grace (CPU Only) Superchip is optimized for single-core performance and offers up to 1 TB/s of memory bandwidth and a TDP of 500W for the 144-core dual chip config.

The performance figures showcased by NVIDIA put the Grace CPU Superchip up against dual-socket (2P) AMD EPYC 7763 "Milan" CPUs across various HPC workloads such as OpenFOAM, WRF, NEMO, and BWA. In OpenFOAM, the Grace CPU Superchip delivers an incredible 2.5x performance increase with up to 3.5x efficiency. On average, NVIDIA's new Grace CPU Superchip should be able to deliver a 1.9x increase in performance and a 2.57x increase in performance per watt compared to AMD's EPYC Milan CPUs. This should also lead to competitive performance against the latest server chips from AMD & Intel.

NVIDIA Grace CPU Superchip vs AMD EPYC 7763 Milan CPUs:

We have already put the numbers into perspective in a previous article which can be seen below:

SPEC Integer Performance (NVIDIA Grace vs AMD EPYC)

Specrate_int_base

200

400

600

800

1000

1200

200

400

600

800

1000

1200

EPYC 7763 (128 Core)

Grace (144 Core)

EPYC 7742 (128 Core)

Grace (72 Core)

nvidia-grace-cpu-superchips-_-hot-chips-34-_15

nvidia-grace-cpu-superchips-_-hot-chips-34-_16

NVIDIA states that its Grace is a highly specialized processor targeting workloads such as training next-generation NLP models that have more than 1 trillion parameters. When tightly coupled with NVIDIA GPUs, a Grace CPU-based system will deliver 10x faster performance than today’s state-of-the-art NVIDIA DGX-based systems, which run on x86 CPUs.

It will definitely be interesting to see how the Grace CPUs stack up against x86 chips but by the time they release, they will be competing against AMD's Genoa and Intel's Sapphire Rapids CPUs. The NVIDIA Grace CPUs are planned to be used in the ATOS supercomputer as reported here.

NVIDIA also

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech's Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Read all comments on NVIDIA Grace CPU Superchip Benchmarks Show 2.5x Performance & 3.5x Efficiency Gain Over AMD EPYC Milan CPUs

NVIDIA Grace CPU Superchip Benchmarks Show 2.5x Performance & 3.5x Efficiency Gain Over AMD EPYC Milan CPUs

NVIDIA Shows Up To 2.5x Performance & 3.5x Efficiency Gain With Grace CPU Superchip Versus AMD EPYC Milan

SPEC Integer Performance (NVIDIA Grace vs AMD EPYC)

Trending Stories

Over 80% Of Samsung Foundry Workers Are Planning To Leave Amid A Yawning Pay Gap With The Memory Division

AMD Medusa Point 10-Core “Zen 6” CPU Beats Strix Point 10-Core “Zen 5” By Nearly 35% While Operating at 5.4 GHz

GameStop May Have Leaked Zelda: Ocarina of Time Remake Pre-Orders for August 4, Hinting First Real Footage Isn’t Far

Snapdragon 8 Elite Gen 6 Pro Could Be A Worthy Choice For Gaming Handhelds As Qualcomm’s Flagship SoC Produces Convincing Results Over Ryzen AI Z2 Extreme

Intel’s Former CEO Gelsinger Admits Firm ‘Scoffed’ at NVIDIA’s GPUs While Riding High on CPU Dominance & Makes Big Quantum Computing Claims

Popular Discussions

AMD Radeon Drivers Silently Add Multi Frame Generation “MFG 8x”, Ray Regeneration, and Neural Radiance Overrides, Hinting At A Bigger FSR Push

AMD Ryzen 7 7700X3D 4.5 GHz “3D V-Cache” CPU Review: The Budget X3D Champ For AM5

AMD Medusa Point 10-Core “Zen 6” CPU Beats Strix Point 10-Core “Zen 5” By Nearly 35% While Operating at 5.4 GHz

NVIDIA GeForce RTX 50 SUPER GPUs Have Reportedly Arrived At AIBs, But Are On Hold Due To Undecided Memory Prices

AMD Ryzen 7 5800X3D Outsells Ryzen 7 7800X3D For The Same Price On Amazon Despite Being Weaker

NVIDIA Grace CPU Superchip Benchmarks Show 2.5x Performance & 3.5x Efficiency Gain Over AMD EPYC Milan CPUs

NVIDIA Shows Up To 2.5x Performance & 3.5x Efficiency Gain With Grace CPU Superchip Versus AMD EPYC Milan

Related Story Get This RTX 5070 For Just $579 At A Time When The GPU Sells For Over $650

Further Reading

Trending Stories

Popular Discussions