With just single-bit flips in DRAM banks, the GPUHammer can easily bring the GPU accuracy to less than 1% on high-end GPUs equipped GDDR6 VRAM.
Toronto Researchers Demonstrate RowHammer-Style Attack on NVIDIA RTX A6000, Which Can Silently Corrupt AI Model Accuracy
The researchers at the University of Toronto demonstrated how RowHammer attacks can easily bring down the AI Model accuracy of GPUs by inducing bit flips in the GPU memory banks. The RowHammer vulnerability, which allows attackers to destroy the data inside the memory cells can also affect the GPU memory as demonstrated by the researchers.
By inducing bit flips across the tested DRAM banks on video memory, which in this case was the GDDR6 VRAM of the NVIDIA RTX A6000, researchers were able to degrade the GPU efficiency in AI models significantly. This was carried out even in the presence of hardware-level defences like the DRAM-target refresh rate (TRR) and with a single bit flip in the FP16 value, the DNN prediction accuracy went from 80% to just 0.1% across major ImageNet models.
The GPUHammer essentially comes into action in three steps: Reverse-Engineering DRAM Bank Mappings, Maximizing Hammering Efficiency, and Synchronization with DRAM Refresh Cycles. The researchers have explained all those steps in detail on the website, which basically helped them trigger the single-bit flips across the four DRAM banks using the ~12K activations per flip. In simple words, the GDDR6 memory on the RTX A6000 becomes vulnerable, but other GPUs with the GDDR6 memory, like the RTX 3080, didn't see such results.
This may be due to the differences in the GDDR6 memory on both GPUs as NVIDIA utilizes memory chips from different vendors like Samsung, SK Hynix, and Micron. Similarly, no bit flips were seen on the NVIDIA RTX 5090, and even data center cards like A100 and H100 GPUs, which boast the HBM memory (High Bandwidth Memory). Thankfully, there is no need to worry even if you own an RTX A6000, since the GPUHammer can be mitigated by enabling ECC (Error-Correcting Code), which can detect and correct hte single-bit flips.
Nonetheless, this can have an adverse effect on the performance of the RTX A6000 and one can see up to 10% slower performance in ML inference workloads and up to 6.25% loss of usable VRAM capacity. Meanwhile, NVIDIA has also issued a security notice regarding this vulnerability and advises SYSTEM-LEVEL ECC to be enabled on affected GPUs. Thankfully, a lot of modern GPUs like the Hopper and Blackwell have ECC enabled by default.
News Sources: GPUHammer, Tomshardware
Follow Wccftech on Google to get more of our news coverage in your feeds.
