It's one thing to partner with either one or two major names in the AI segment, but OpenAI got AMD, NVIDIA, Intel, Microsoft, & Broadcom to accelerate large-scale AI training.
OpenAI Announces Mega-Partnership With AMD, NVIDIA, Intel, Microsoft & Broadcom To Accelerate AI
The new announcement from OpenAI comes as a supercomputer network partnership that aims to accelerate large-scale AI training. For this purpose, AMD, Broadcom, Intel, Microsoft, and NVIDIA are working with the firm to develop a new protocol called MRC (Multipath Reliable Connection) with the goal of improving GPU networking performance and resilience in large training clusters.
OpenAI has released MRC today through the OCP (Open Compute Project) to facilitate the broader use of the protocol across AI firms.
The problem that pushed the need for MRC is the transfer of data when training large AI models. It is stated that even a single transfer arriving late can disrupt the entire process, causing GPUs to sit idle. The main sources of this delay are linked to Network Congestion, link and device failures. The larger the size of the cluster, the more common this problem occurs.
MRC is the fundamental approach for next-gen and large-scale supercomputer platforms for AI. OpenAI states that it has worked with AMD, Broadcom, Intel, Microsoft, and NVIDIA over the past two years to develop the protocol, which is built into the latest 800 Gb/s network interfaces, allowing AI firms to spread a single transfer across hundreds of uninterrupted paths, route around failures in microseconds, and also run simpler network control planes.
Instead of treating each network interface as one 800Gb/s link, we split it into multiple smaller links. For example, one interface can connect to eight different switches. You can then build eight separate parallel networks, or planes, each operating at 100Gb/s, rather than a single 800Gb/s network.
That change has a large effect on the shape of the cluster. A switch that can connect 64 ports at 800Gb/s can instead connect 512 ports at 100Gb/s. This lets you build a network fully connecting about 131,000 GPUs with only two tiers of switches. A conventional 800Gb/s network would require three or four tiers.
OpenAI
The MRC standard will extend the existing RDMA over RoCE (Converged Ethernet). This enables hardware-accelerated remote direct memory access for GPUs and CPUs. OpenAI has already deployed MRC across its supercomputers housing the NVIDIA GB200 "Blackwell" GPUs that are used to train Frontier models. These include Oracle Cloud Infrastructure (OCI) in Abilene, Texas, & Microsoft’s Fairwater supercomputers.
Currently, MRC has been used to train multiple OpenAI models across NVIDIA and Broadcom hardware. The RCP protocol will be fundamental to OpenAI's Stargate supercomputer, built by Oracle Cloud Infrastructure at Abilene, Texas. The supercomputer aims to deploy 10GW of AI compute by 2029 and has already deployed over 3GW in the past 3 months. With RCP available and open to the entire AI industry, it paves the way for cross-industry collaboration in solving the hardest problems within AI and advances the segment further.
Follow Wccftech on Google to get more of our news coverage in your feeds.
