⋮    ⋮  

Microsoft Patch Enables Hotswapping AMD GPUs In Linux Systems


Microsoft is synonymous with cloud computing with its Azure server technology in several enterprises globally. Currently, the company utilizes AMD data center GPUs and Linux in their servers. However, when a new GPU needs to be replaced or installed into their servers, it demands that the server shut down to change graphics card units.

A particular driver for GPU disaggregation technology, intended for AMD graphics cards, receives assistance from engineers at Microsoft

Microsoft has created a unique driver to enable "hot-plugging" for the AMD GPUs on their Linux servers to initiate these replacements. Hot-plugging is when a graphics card can be removed from the PCIe connector and substituted with another while the system is active.

Watch The AMD Computex 2022 Keynote Live Here – Ryzen 7000 CPUs, AM5 Motherboards, Next-Gen GPUs & More

Shuotao Xu, an engineer from the Microsoft Research group, posted the below request for a code review for AMDGPU Hotplug Support. The patch is prepared for use in Linux operating systems. It is focused on the Microsoft Azure systems to aid in the capability of hot-plug GPU-based accelerators, should the need arise. The Microsoft Research group placed a similar request on GitHub, which readers can find here.

Dear AMD Colleagues,

We are from Microsoft Research and are working on GPU disaggregation technology.

We have created a patch against https://gitlab.freedesktop.org/agd5f/linux.git against drm-staging-drm-next, which will enable PCIe hot-plug support for amdgpu

We have also created a pull request Add PCIe hotplug support for amdgpu by xushuotao · Pull Request #131 · RadeonOpenCompute/ROCK-Kernel-Driver (github.com)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Cshuotaoxu%40microsoft.com%7Cc86224bc365f44bec6b408da172ecac1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637847787066456985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PA8l7Cj82dphBHbo82zqTEQUM4kGM7yg5UeQuduhDg0%3D&reserved=0> in ROCK-Kernel-Driver, against rocm-5.0.x.

We believe the support of hot-plug of GPU devices can open doors for many advanced applications in data center in the next few years, and we would like to have some reviewers on this PR so we can continue further technical discussions around this feature.

Would you please help review this patch?

Thank you very much!

Best regards,

Shuotao Xu

— The code review request for AMDGPU Hotplug Support

There is little information from Microsoft about the new GPU disaggregation technology. However, since the driver is proprietary to Microsoft, it is intended to allow Azure systems to include GPU acceleration to their servers that have yet to install a graphics card. With servers working harder than consumer machines, the capability of hot-plug support for GPUs would be a significantly helpful tool.

Hot-plugging graphics cards and accelerators through the PCIe connector is a new concept. The initial hot plugging is used in some consumer systems, such as the eGFX box, which allows an AMD card to be hot-plugged into a Thunderbolt 3 connection. Servers have yet to see this functionality. With data centers becoming more prevalent in the market, this new technology would benefit Microsoft with their Azure systems, AMD, and the company's GPU lines.

Source: Freedesktop.org, GitHub