AMD-Powered Frontier Supercomputer Uses 3K of Its 37K MI250X GPUs To Achieves a Whopping 1 Trilllion Parameter LLM Run, Comparable To ChatGPT-4

Muhammad Zuhair
USA Plans To Produce The Fastest Supercomputer Called Discovery, Surpassing The Frontier By 3-5 Times 1

The AMD-powered Frontier Supercomputer with Instinct MI250X GPUs has achieved a 1 Trillion Parameter LLM run, rivaling ChatGPT-4.

The Frontier Supercomputer Sets New Records In The Space of LLM Training, Courtesy of AMD's EPYC CPUs & Instinct GPUs

The Frontier supercomputer is the world's leading supercomputer and the only Exascale machine that is currently operating. This machine is powered by AMD's EPYC & Instinct hardware which not only offers the top HPC performance but is also the 2nd most efficient supercomputer on the planet. A submission report on Arxiv by individuals has revealed that the Frontier supercomputer has reached the ability to train one trillion parameters through "hyperparameter tuning", setting a new industry benchmark.

Related Story AMD Lands Major U.S. Government AI Deal to Power Next-Gen Supercomputers, Featuring Instinct MI355X & the Newer MI430 AI Chips

Before we go into the crux, let's take a quick recap on what the Frontier supercomputer holds. The supercomputer by ORNL has been designed from the ground up with AMD's 3rd Gen EPYC Trento CPUs and Instinct MI250X GPU accelerators. It is installed at the Oak Ridge National Laboratory (ORNL) in Tennessee, USA, where it is operated by the Department of Energy (DOE). It currently has achieved 1.194 Exaflop/s using 8,699,904 cores. The HPE Cray EX architecture combines 3rd Gen AMD EPYC CPUs optimized for HPC and AI, with AMD Instinct 250X accelerators and a Slingshot-11 interconnect. Frontier has been able to maintain the number one spot on the Top500.org list of supercomputers, showing its dominance.

The new records achieved by Frontier are a result of implementing effective strategies to train LLMs and use the onboard hardware most efficiently. The team has been able to achieve notable results through their extensive testing of 22 Billion, 175 Billion, and 1 Trillion parameters, and the figures obtained are a result of optimizing and fine-tuning the model training process. The results were achieved by employing up to 3,000 AMD's MI250X AI accelerators, which have shown their prowess despite being a relatively outdated piece of hardware.

What's more interesting is that the whole Frontier supercomputer houses 37,000 MI250X GPUs so one can imagine the kind of performance when using the entire GPU pool to power LLMs. AMD is also on the verge of implementing its MI300 GPU accelerators in brand-new supercomputers with a robust ROCm 6.0 ecosystem that further accelerates AI performance.

For 22 Billion, 175 Billion, and 1 Trillion parameters, we achieved GPU throughputs of 38.38%, 36.14%, and 31.96%, respectively. For the training of the 175 Billion parameter model and the 1 Trillion parameter model, we achieved 100% weak scaling efficiency on 1024 and 3072 MI250X GPUs, respectively. We also achieved strong scaling efficiencies of 89% and 87% for these two models.

- Arvix

The future holds plenty for the server and the data center segment, and it is important to note that Frontier currently employs hardware that isn't relatively new in the industry. With continuous advancements within the generative AI segment, it is evident that markets would need more computing power moving ahead, which is why the advancements within hardware designed for this segment are vital for next-gen progression.

News Source: Arvix

Muhammad Zuhair Photo

About the author: Muhammad Zuhair is a hardware and technology reporter for Wccftech, specializing in the semiconductor industry and the complex interplay between technology, manufacturing, and geopolitics. His coverage focuses on the corporate strategies and technological roadmaps of industry giants like TSMC, NVIDIA, Samsung, and Intel. Zuhair's expertise lies in deconstructing complex topics such as fabrication nodes (e.g., 2nm process), the economic impact of policies like the CHIPS Act, and the strategic development of AI infrastructure from NVIDIA, AMD and Intel.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Button