China has reportedly managed to seek an alternative to NVIDIA's "cut-down" AI accelerators, as DeepSeek's newest project has brought in eight times the TFLOPS with the Hopper H800s AI accelerators.
DeepSeek's FlashMLA Will Help China's AI Industry To Squeeze Out Maximum Power From NVIDIA's Cut-Down Hopper GPUs
It seems like China isn't depending on anyone to scale up in terms of hardware capabilites, as in-house companies, notably DeepSeek, are utilizing the power of software to find workarounds with the equipment they have available. The latest developments by DeepSeek are some of the wildest ones we have seen in the markets, as, according to the firm, they have managed to squeeze out significant performance from NVIDIA's "cut-down" Hopper H800 GPUs by essentially optimizing memory consumption and allocation of resources across inference requests.
🚀 Day 1 of #OpenSourceWeek: FlashMLA
Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.
✅ BF16 support
✅ Paged KV cache (block size 64)
⚡ 3000 GB/s memory-bound & 580 TFLOPS…— DeepSeek (@deepseek_ai) February 24, 2025
Just a quick background: DeepSeek is holding an "OpenSource" week, where it plans to unveil technologies and tools that will be easily available to the general public through Github repositories. The first day looks to be a great start since the firm unveiled FlashMLA, a "decoding kernel" designed particularly for NVIDIA's Hopper GPUs. Before we go into how it works out, let's take a quick look at the enhancements it has brought into the markets, and they surely are revolutionary.
DeepSeek claims that they have managed to squeeze out 580 TFLOPS for BF16 matrix multiplication on the Hopper H800, which is approximately eight times higher than the industry's standard rating. Not only this, but with efficient memory utilization, FlashMLA enables memory bandwidth of up to 3000 GB/s, which is almost two times the H800's theoretical peak. The important point here is all of this becomes possible simply through lines of code rather than hardware enhancements.
This is crazy.
-> Blazing fast: 580 TFLOPS on H800, ~8x industry avg (73.5 TFLOPS).
-> Memory wizardry: Hits 3000 GB/s, surpassing H800’s 1681 GB/s peak.— Visionary x AI (@VisionaryxAI) February 24, 2025
DeepSeek's FlashMLA implements "low-rank key-value compression", which, in easy terms, factorizes chunks of data into smaller portions, allowing for a faster processing, along with reduced memory consumption by up to 40%-60%. Another interesting inclusion is the use of block-based paging system, which dynamically allocates memory depending upon the intensity of the task, instead of a one fix value. This helps models to process variable-length sequences much more effectively, ultimately enhancing performance.
DeepSeek's development shows that the world of AI computing isn't dependent upon a single factor; rather, it is much more diverse, and this is clearly evident with FlashMLA. For now, it appears that the tool is specific for Hopper GPUs only, and it will be interesting to see what sort of performance we could bring in with the H100 through FlashMLA.
Follow Wccftech on Google to get more of our news coverage in your feeds.
