Intel Unveils Habana Gaudi2 & Greco 7nm Deep Learning Accelerators: Gaudi2 With 24 TPCs, 96 HBM2e, 600W TDP Offering Faster Training Performance Than NVIDIA Ampere A100


Intel has today officially unveiled its 7nm Habana Gaudi2 and Greco Deep Learning accelerators, offering up to 2x the throughput performance versus NVIDIA's Ampere A100 GPU.

Intel Unveils 7nm Habana Gaudi2 & Greco Deep Learning Accelerators, Up To 2x The Throughput Performance Versus NVIDIA's Ampere A100

The latest Deep Learning accelerators for data centers were designed at Intel Habana Labs. These are the latest dedicated Deep Learning platforms, offering a high percentage of DL training and/or inference. So starting with the details, we should first point out that both the Habana Gaudi2 and the Greco are based on a 7nm process node. Unfortunately, this detail doesn't really help us much because 7nm could be referring to the N7 process on TSMC, Intel 7 (formerly Intel 10nm), or Intel 4 (formerly Intel 7nm and the least likely).

Linux Adds Improved Power Management for Intel Arc Alchemist GPUs

The original Habana Gaudi processors were built on the 16nm TSMC process which makes it more likely for this chip to be on N7 or Intel 7. Whatever the case is, considering the Gaudi 2 platform is clearly on a far smaller node than 16nm (which in itself gives a density increase of roughly 50%), As for the specifications, the Gaudi2 features 24 TPCs for media decode and processing running on a FP8 format (versus 8 TPCs). The memory configuration includes 96 GB of HBM2e memory, offering 2.45 TB/s bandwidth and an additional 48 MB of SRAM. Networking is provided through 24 100GbE switches. Such a big jump in performance also means that the TDP has to be upped dramatically & the Gaudi2 operates at a 600W TDP (versus 350W).

In terms of performance, ResNet-50 training throughput shows a 1.9x gain for the Intel Habana Gaudi2 accelerator versus a single A100 80 GB GPU. In NLP BERT Phase-1 Training, the chip has a 1.7x throughput and a 2.8x throughput in Phase-2 training. Lastly, Intel also put together a BERT training throughput comparison which shows a 2.0x gain for the Gaudi2 over its competitor, the NVIDIA A100. Overall, the new accelerator offers training cost savings of up to 75% versus NVIDIA solutions.

There's also the Intel Habana Greco which is a deep learning inference designed for peak efficiency and is also based on the same 7nm process node. The accelerator offers 16 GB of memory with 240 GB/s LPDDR5 memory and an additional 128 MB of on-chip SRAM. The compute capabilities include BF16, FP16, and INT4 formats for media decode and processing.

The TDP is rated at just 75W. Compared to the OAM module that the Gaudi2 comes in, the Greco comes in a single-slot HHHL form factor. Since its TDP is rated at 75W, there's no need for external power connectors on the card.

Intel has also announced that the 7nm Gaudi2 processor is available to customers starting now while the Greco will be sampling to select customers in the second half of 2022.