AMD RDNA 2 GPUs Have Much Better Memory Latency Versus NVIDIA’s Ampere GPU Architecture
The memory latency performance of AMD's RDNA 2 & NVIDIA's Ampere GPU architectures has been tested by Chips and Cheese. The tech outlet decided to test out the GPU memory latency performance of the latest GPU architectures from team red and team green & found out some interesting results.
AMD's RDNA 2 GPUs Feature Superior Memory Latency Performance Compared To NVIDIA's Ampere GPU Architecture
On the CPU side, measuring cache and latency performance has become a crucial pointer with the ever-increasing use of multi-chiplet dies and several IO chips onboard the same die and in recent instances, off-die too (AMD Zen chiplets). GPUs are also composed of several cache hierarchies that fill in the gaps between compute and memory performance and the source used OpenCL-based pointer chasing benchmarks to measure cache and memory latency performance on current-gen of GPUs such as the NVIDIA Ampere and AMD RDNA 2 architectures.
In the benchmarks, the AMD Radeon RX 6800 XT (RDNA 2 GPU) & the NVIDIA GeForce RTX 3090 (Ampere GPU) were positioned against each other. The cache and memory benchmark shows that AMD's RDNA 2 architecture fared far better than NVIDIA's Ampere GPU, delivering lower latency despite having to check two more levels of cache on the way to the memory. The use of Infinity cache only adds 20ns over L2 hit and is still faster than NVIDIA's Ampere.
The reason stated is that the NVIDIA Ampere-based GA102 GPU is simply a much larger GPU and while it uses a more conventional GPU memory subsystem with only two cache levels, it has to take a lot of cycles and results in over 100ns latency (L1 to L2). RDNA 2 on the other hand has a latency of just 66ns. Do note that the AMD Navi 21 GPU is much smaller & features a 4 MB L2 cache while the NVIDIA GA102 GPU features a 6 MB L2 cache for the whole chip. The NVIDIA A100 Ampere GPU for HPC features a massive 40 MB L2 cache.
Following is a note on the performance from Chips and Cheese:
RDNA 2’s cache is fast and there’s a lot of it. Compared to Ampere, latency is low at all levels. Infinity Cache only adds about 20 ns over a L2 hit and has lower latency than Ampere’s L2. Amazingly, RDNA 2’s VRAM latency is about the same as Ampere’s, even though RDNA 2 is checking two more levels of cache on the way to memory.
In contrast, Nvidia sticks with a more conventional GPU memory subsystem with only two levels of cache and high L2 latency. Going from Ampere’s SM-private L1 to L2 takes over 100 ns. RDNA’s L2 is ~66 ns away from L0, even with a L1 cache between them. Getting around GA102’s massive die seems to take a lot of cycles.
This could explain AMD’s excellent performance at lower resolutions. RDNA 2’s low latency L2 and L3 caches may give it an advantage with smaller workloads, where occupancy is too low to hide latency. Nvidia’s Ampere chips in comparison require more parallelism to shine.
Compared to older Pascal and Maxwell chips, the Ampere architecture has led to highly improved latency speeds on much larger GPUs. AMD on the other hand has shown some impressive gains vs older GCN and VLIW architecture-based chips. These numbers are definitely going to be interesting for comparison once the new round of chiplet based GPUs hits the gaming segment in the coming years.