Meta Unveils The AI Research SuperCluster Supercomputer, Powered By NVIDIA’s A100 GPU & Packs 220 PFLOPs Horsepower
Meta reveals today that they have not only designed but built the new AI Research SuperCluster (RSC)—possibly the most efficient AI supercomputers currently in the industry and the world and powered by NVIDIA's latest Ampere A100 GPUs. Keep in mind that the company states that it is built but also declares that it is not fully built but does anticipate complete production in the middle of 2022.
The Metaverse starts today with the development of the AI Research SuperCluster, a supercomputer with NVIDIA's A100 GPUs focusing on AI to assist with Meta's future in technology & virtual reality
Meta currently has researchers utilizing the RSC to run computations to train models in natural language processing (NLP), along with computer vision for research purposes, with the goal of training AI with trillions of parameters.
Developing the next generation of advanced AI will require powerful new computers capable of quintillions of operations per second.
— Kevin Lee and Shubho Sengupta, Technical Program Manager and Software Engineer, respectively
The new Research SuperCluster will aid Meta's researchers of artificial intelligence to develop improved AI models that will be smarter and capable of learning from trillions of instances; process information from several languages simultaneously; analyze text, images, and video simultaneously; create unique augmented reality devices and implemented tools; along with several other projects in the designing stages.
Meta wants to see AI-powered applications and developments take the lead in creating the virtual universe that society incorporates as a mere buzzword. One example of the new AI supercomputer is the capability to control voice translation from a large group of people in real-time instead of using human translators to slow the conversation, allowing many people to collaborate on a project or play a multiplayer game at once. But, the underlining use for the new Research SuperCluster is to help build new technologies for the metaverse.
Facebook initially created the AI Research lab in 2013 when the company executed a long-term investment in artificial intelligence. Several advancements in AI have become incorporated into our world, and Meta explains how their progression is included in transformers that assist AI models to process information higher than before by pinpointing specific areas and self-supervised learning, helping formulas comprehend a large number of numerals from unknown examples.
To fully realize the benefits of self-supervised learning and transformer-based models, various domains, whether vision, speech, language, or for critical use cases like identifying harmful content, will require training increasingly large, complex, and adaptable models. Computer vision, for example, needs to process larger, longer videos with higher data sampling rates. Speech recognition needs to work well even in challenging scenarios with lots of background noise, such as parties or concerts. NLP needs to understand more languages, dialects, and accents. And advances in other areas, including robotics, embodied AI, and multimodal AI will help people accomplish useful tasks in the real world.
Since high-performing computing infrastructures are crucial to artificial intelligence training, Meta divulges that they have researched and built systems to fulfill these needs for many years. Their first version initially came to design in 2017, utilizing 22,000 NVIDIA V100 Tensor Core graphics processors located on a single grouping to complete 35,000 training assignments in a single day. This design element permitted the research teams at Meta to achieve high levels of productivity, performance, and reliability.
Two years ago, the company realized that to move forward with their developments, they would need to develop a new platform for the levels of computing being completed. They designed the infrastructure to use newer graphics cards and network fabric technology from the ground up. Their goal? Meta researchers wanted to train the AI utilizing trillions of parameters on exabyte-sized data sets, comparable to 36,000 years of high-quality video.
Meta also includes the need to identify harmful content found on all social media platforms, including their own. With the research capabilities of both embodied AI and multimodal AI, the company plans to improve user experience on a larger scale with its series of applications.
Meta explains what is currently powering the AI Research SuperComputer in 2022:
RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs — with each A100 GPU being more powerful than the V100 used in our previous system. Each DGX communicates via an NVIDIA Quantum 1600 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
Several benchmarks have been tested, showing the RSC processes computer vision workflows as high as 20 times more efficiently, executes the NVIDIA Collective Communication Library, or NCCL, as much as nine times faster, and implements learning for high-scale NLP models as high as three times faster than their previous research systems. The equivalency of the supercomputer is equal to "tens of billions of parameters [that] finish training in three weeks, compared with nine weeks before."
Once the RSC is complete, it's InfiniBand network fabric will link 16,000 GPUs as endpoints, operating information with 4,000 AMD EPYC processors, constructing it as one of the most extensive networks deployed. Also, Meta developed a caching and storage system that can conform 16 TB/s of data for training, with plans to increase the size to one exabyte.
Partners that have worked on the RSC project with Meta are Penguin Computing—an SGH company that worked closely with the operations team to integrate hardware to deploy clusters and assisted with the control plane of the supercomputer. Another partner was Pure Storage, which offered a unique but customizable storage solution. Finally, NVIDIA offered the use of their AI technologies, which cover graphics cards, next-gen systems, and the InfiniBand fabric, as well as the NCCL, to work in tandem with the cluster.
Meta details that the RSC supercomputer is currently running even though still in development. The company also states that they are currently in Phase Two of the project and reminds readers that this new development will begin the basis of what they are considering the ground floor of the metaverse.
Source: Meta AI