AMD has just announced a major win in the HPC sector with its next-generation EPYC and Radeon accelerators powering the 2 Exaflop El Capitan supercomputer of the U.S. Department of Energy or DOE, which should be operational by 2023.
AMD's EPYC Genoa & Radeon Instinct HPC Accelerators To Drive 2 Exaflop 'El Capitan' Supercomputer
All three giants including Intel, AMD & NVIDIA were competing to win the contract for DOE's latest supercomputer, but it looks like AMD won on both the CPU & GPU front. The El Capitan supercomputer will be built by HPE's Cray supercomputing division which would utilize the next-generation accelerators from AMD to bring this exaflop monster to life by 2023. The supercomputer will be deployed at the Lawrence Livermore National Laboratory and will be able to perform up to 2 quintillion calculations per second.
"We expect when it's delivered to the laboratory in 2023, it will be the fastest supercomputer in the world," said Bill Goldstein, director of the Livermore lab
The entire system will cost $600 million to assemble and will be at least 16 times faster than the Sierra (IBM Power 9 + NVIDIA Volta) supercomputer that is currently deployed at LLNL. As for the specifications of the system itself, we know that EPYC Genoa will be powering the CPU side while a next-generation Radeon Instinct accelerator will power the GPU side of things. The whole system will consume less than 40MW when it becomes operational.
The following are a list of AMD technologies to be included in the El Capitan supercomputer:
- Next-generation AMD EPYC processors, codenamed “Genoa” featuring the “Zen 4” processor core. These processors will support next-generation memory and I/O subsystems for AI and HPC workloads,
- Next-generation Radeon Instinct GPUs based on a new compute-optimized architecture for workloads including HPC and AI. These GPUs will use the next- generation high bandwidth memory and are designed for optimum deep learning performance,
- The 3rd Gen AMD Infinity Architecture, which will provide a high-bandwidth, low latency connection between the four Radeon Instinct GPUs and one AMD EPYC CPU included in each node of El Capitan. As well, the 3rd Gen AMD Infinity Architecture includes unified memory across the CPU and GPU, easing programmer access to accelerated computing,
- An enhanced version of the open-source ROCm heterogenous programming environment, being developed to tap into the combined performance of AMD CPUs and GPUs, unlocking maximum performance.
AMD EPYC Genoa - Post-7nm Zen 4 Cores, SP5 Socket Platform, DDR5 Memory, PCIe 5.0 Protocol
The AMD EPYC Genoa processors based on the Zen 4 core architecture were a mystery until AMD officially unveiled them in their latest roadmap during the EPYC Rome launch. Currently in-design with a planned launch for 2021, the Genoa lineup will bring a brand new set of features to the server landscape.
AMD announced that EPYC Genoa will be compatible with the new SP5 platform which brings a new socket, so SP3 compatibility will exist up till EPYC Milan. The EPYC Genoa processors will also feature support for new memory and new capabilities. It looks like AMD would definitely be jumping on board the DDR5 bandwagon in 2021. It is also stated that new capabilities will be introduced on EPYC Genoa which sounds like a hint at the new PCIe 5.0 protocol which will double the bandwidth of PCIe 4.0, offering 128 Gbps link speeds across an x16 interface.
Summing everything up for EPYC Genoa, we are looking at the following main features:
- Post-7nm Zen 4 cores
- SP5 Platform With New Socket
- PCIe 5.0 Support
- DDR5 Memory Support
- Launch in 2021
On the Radeon Instinct side, we are definitely looking at a much powerful and possibly sub-7nm GPU based graphics accelerator. AMD is currently prepping to launch its Radeon Instinct Mi100 accelerator which is codenamed 'Arcturus' and reportedly features up to 8192 stream processors and 32 GB of HBM2e memory.
The GPU is definitely a beast on its own, but it's planned for a 2020 launch and to make sure that they remain future proof, the El Capitan supercomputer will be definitely deploying something newer than the Radeon Instinct Mi100. The exact graphics card or accelerator has not been mentioned, but it is stated that the new GPU offers a brand new compute architecture which is:
- Optimized for HPC and AI
- Extensive Mixed Precision Ops for Optimized Deep Learning Performance
- Next-Generation High Bandwidth Memory
- Maximize Performance With Multi-GPU Scaling
Based on the featureset, we are definitely looking at something beyond HBM2e and PCIe Gen 4 which will be readily available by 2021 while El Capitan will become operational in 2023. The GPU is also said to be specifically designed for Compute / AI / HPC workloads which means that it would be a custom design for the said segment and not a chip that you would get to see in the consumer space, much like NVIDIA's own HPC accelerators.
Also, the third major feature of El Capitan will be that each AMD CPU / GPU accelerator will be equipped with the 3rd Generation Infinity Fabric interconnect. Mentioned as Infinity Fabric 3.0, the new interconnect will allow a high-bandwidth & low-latency connection between the CPU & GPU, allow a unified memory across CPU & GPU while the whole coherent nature of the platform would improve overall performance and simplify programming.
In the slide posted by AMD, it looks like each node would have four Radeon accelerators directly linked to an AMD EPYC Genoa processor through Infinity Fabric 3.0. It will be used in addition to Cray's own Slingshot fabric which currently pushes up to 200 Gb/s of bandwidth, but future versions could offer even more interconnect bandwidth to the El Capitan infrastructure. The difference here is that Slingshot is more of a node-to-node channel while Infinity Fabric is closer to a CPU-GPU interconnect.
At #oghpc, @BradleyMccredie of @AMD (presenting virtually) makes the argument not only for heterogeneous computing, but additionally cache coherency between CPU and GPU. #HPC pic.twitter.com/Nuut23grk4
— Addison Snell (@addisonsnell) March 2, 2020
It's also mentioned that there will be cache coherency between the CPU & GPU aside from just memory which will be a big deal for future HPC platforms. A slide showcasing advantages of a heterogeneous platform was showcased by AMD and shared by Addison Snell over at his Twitter feed which gives us a good idea of what to expect from the future compute & acceleration platforms. With that said, AMD has its Financial Analyst Day tomorrow, so we can definitely expect more details on Zen 4 and the said Radeon Instinct accelerators during the presentation.