NVIDIA & IBM Working To Connect GPUs Directly to SSDs For Major Performance Boost Instead of Relying on CPUs


NVIDIA, IBM, and several university members have created an architecture to provide fast "fine-grain access" to considerable amounts of data storage for GPU-accelerated applications. This technology will benefit areas such as artificial intelligence, analytics, and machine-learning training.

Breakthrough in GPU performance technology from NVIDIA, IBM, and universities to increase the performance by directly connecting to SSDs instead of relying on the CPU

Big accelerator Memory, or BaM, is an intriguing endeavor to lower the dependence of NVIDIA GPUs & comparable hardware accelerators on a standard CPU such as accessing storage, which will improve performance and capacity.

Corsair Enters The Laptop Segment With Its All AMD-Powered Voyager a1600 Laptop: AMD Advantage Design With Ryzen 9 6900HS & Radeon RX 6800M Starting at $2700 US

The goal of BaM is to extend GPU memory capacity and enhance the effective storage access bandwidth while providing high-level abstractions for the GPU threads to easily make on-demand, fine-grain access to massive data structures in the extended memory hierarchy.

— BaM design paper written by the researchers

NVIDIA is the most prominent member of the BaM team, using their extensive resources for inventive projects such as moving routine CPU-focused tasks to GPU performance cores. Instead of depending on virtual address translation, page-fault-based on-demand data loading, and additional standard CPU-based mechanisms for managing considerable amounts of data, the new BaM will deliver software and hardware architecture allowing NVIDIA graphics processors to grab data straight from memory and storage areas and function that data without relying on only CPU cores.

Dissecting BaM for viewers, we see two prominent features: a software-managed cache of GPU memory. The assignment of transferring info between data storage and the graphics card is managed by the threads located on the cores of the GPU, through a process of using RDMA, PCI Express interfaces, and custom Linux kernel drivers, allowing for the SSDs to write and read memory from the GPU when required. Secondly, the software library for GPU threads requests data directly from NVMe SSDs by communicating with those drives. Driver commands are prepared by the GPU threads only under the order if the specific data requested is not located in the software-managed cache locations.

Algorithms operating on the graphics processor to complete heavy workloads will be able to access the information required efficiently and of utmost importance in such a way that is optimized for their specific data access routines.

From L-R: Comparison of traditional CPU-centric method to accessing storage, the GPU-directed BaM method, and how the GPU would physically be connected to the SSDs. Image source: Qureshi et al. via The Register

A CPU-centric strategy causes excessive CPU-GPU synchronization overhead and/or I/O traffic amplification, diminishing the effective storage bandwidth for emerging applications with fine-grain data-dependent access patterns like graph and data analytics, recommender systems, and graph neural networks,” the researchers stated in their paper this month.

BaM provides a user-level library of highly concurrent NVMe submission/completion queues in GPU memory that enables GPU threads whose on-demand accesses miss from the software cache to make storage accesses in a high-throughput manner," they continued. "This user-level approach incurs little software overhead for each storage access and supports a high-degree of thread-level parallelism.

Researchers from the three groups experimented on a prototype Linux-based system utilizing BaM and standard GPUs and NVMe SSDs to exhibit the design as a viable alternative to the current approach of the CPU directing all matters. Researches explain that the storage access can be put into simultaneous work, that the synchronization limitations are dismissed, and I/O bandwidth is used to boost application performance much more efficiently than before.

With the software cache, BaM does not rely on virtual memory address translation and thus does not suffer from serialization events like TLB misses.

— NVIDIA's chief scientist Bill Dally, who previously led Stanford's computer science department, and other prominent authors notate in the paper.

The new details of the BaM design will be open-sourced for both the company's hardware and software optimization for other companies to create such designs of their own. Similar functionality is AMD's Radeon Solid State Graphics card that positioned flash next to a graphics card processor.

News Source: The Register