Oak Ridge National Laboratory or ORNL is the home to the Frontier supercomputer. Frontier is marked as the first exascale-level system created using AMD's EPYC Trento CPUs & Instinct MI250X compute Accelerators. The entire system makes use of HPE's Slingshot interconnects. It is also slated as the world's fastest supercomputer available and is the world's only operational Exascale design.
AMD MI250X compute GPUs & HPE's Slingshot interconnects could be behind the issues surrounding the Frontier supercomputer's failing performance and conflicts
The Cray EX architecture by HPE was created for large-scale applications that researchers would be able to access to assist in scientific research starting in 2023. However, the supercomputer cannot run an entire day without several failures located within the hardware.
The ORNL Frontier boots up but can only produce a maximum of 1 FP64 ExaFLOPS, whereas the system was designed to deliver 1.685 FP64 ExaFLOPS. While no word has been given regarding the specific issues, a few rumors are coming to light.
First, the Slingshot interconnects, the network created for HPE Cray supercomputers, conflicts with the HPE clusters. Unfortunately, the specificity of the exact issue is unknown. Secondly, the AMD Instinct MI250X compute GPUs and the EPYC Trento CPUs are rumored to conflict with the Slingshot interconnects. Again, no official word has come from the project leads or researchers of the ORNL Frontier supercomputer.
In an article on insideHPC (December 2021), Mike Bernhardt, the Department of Energy's (DOE) Exascale Computing Project, stated that the full integration of ORNL Frontier will be available to researchers starting next year but is not quoted as having any concerns or issues with the full launch of the Frontier supercomputer.
ORNL's partners in the exascale effort, HPE and AMD, have delivered the new Frontier system to ORNL ahead of the schedule for this fal. The installation and integration of Frontier, a massive, complex effort, is now underway, and the current progress indicates everything is on track to have Frontier available to users for open science next year — as anticipated.
Mike Bernhardt (Communication Lead for DOE's Exascale Computing Project) via InsideHPC
The placement of Bernhardt stating "complex effort" could lead to why rumors abound concerning the project. It is also to note that AMD's MI250X compute GPUs are only available to select customers, which is why there is a lack of benchmarks to back the rumored claims. The DOE has worked closely with Oak Ridge's Leadership Computing Facility on Frontier. The ORNL Frontier supercomputer is slated to become fully operational by January 1, 2023, after missing an initial 2022 deadline.