China Doesn’t Need “Cutting-Edge” Accelerators To Progress With AI; DeepSeek’s Newest “FlashMLA” Project Now Brings In 8x TFLOPS Power Boost With NVIDIA’s H800 GPUs

•

Feb 24, 2025 at 10:28am EST

China has reportedly managed to seek an alternative to NVIDIA's "cut-down" AI accelerators, as DeepSeek's newest project has brought in eight times the TFLOPS with the Hopper H800s AI accelerators.

DeepSeek's FlashMLA Will Help China's AI Industry To Squeeze Out Maximum Power From NVIDIA's Cut-Down Hopper GPUs

It seems like China isn't depending on anyone to scale up in terms of hardware capabilites, as in-house companies, notably DeepSeek, are utilizing the power of software to find workarounds with the equipment they have available. The latest developments by DeepSeek are some of the wildest ones we have seen in the markets, as, according to the firm, they have managed to squeeze out significant performance from NVIDIA's "cut-down" Hopper H800 GPUs by essentially optimizing memory consumption and allocation of resources across inference requests.

🚀 Day 1 of #OpenSourceWeek: FlashMLA

Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.

✅ BF16 support
✅ Paged KV cache (block size 64)
⚡ 3000 GB/s memory-bound & 580 TFLOPS…

— DeepSeek (@deepseek_ai) February 24, 2025

Just a quick background: DeepSeek is holding an "OpenSource" week, where it plans to unveil technologies and tools that will be easily available to the general public through Github repositories. The first day looks to be a great start since the firm unveiled FlashMLA, a "decoding kernel" designed particularly for NVIDIA's Hopper GPUs. Before we go into how it works out, let's take a quick look at the enhancements it has brought into the markets, and they surely are revolutionary.

DeepSeek claims that they have managed to squeeze out 580 TFLOPS for BF16 matrix multiplication on the Hopper H800, which is approximately eight times higher than the industry's standard rating. Not only this, but with efficient memory utilization, FlashMLA enables memory bandwidth of up to 3000 GB/s, which is almost two times the H800's theoretical peak. The important point here is all of this becomes possible simply through lines of code rather than hardware enhancements.

This is crazy.
-> Blazing fast: 580 TFLOPS on H800, ~8x industry avg (73.5 TFLOPS).
-> Memory wizardry: Hits 3000 GB/s, surpassing H800’s 1681 GB/s peak.

— Visionary x AI (@VisionaryxAI) February 24, 2025

DeepSeek's FlashMLA implements "low-rank key-value compression", which, in easy terms, factorizes chunks of data into smaller portions, allowing for a faster processing, along with reduced memory consumption by up to 40%-60%. Another interesting inclusion is the use of block-based paging system, which dynamically allocates memory depending upon the intensity of the task, instead of a one fix value. This helps models to process variable-length sequences much more effectively, ultimately enhancing performance.

DeepSeek's development shows that the world of AI computing isn't dependent upon a single factor; rather, it is much more diverse, and this is clearly evident with FlashMLA. For now, it appears that the tool is specific for Hopper GPUs only, and it will be interesting to see what sort of performance we could bring in with the H100 through FlashMLA.

About the author: Muhammad Zuhair is a hardware and technology reporter for Wccftech, specializing in the semiconductor industry and the complex interplay between technology, manufacturing, and geopolitics. His coverage focuses on the corporate strategies and technological roadmaps of industry giants like TSMC, NVIDIA, Samsung, and Intel. Zuhair's expertise lies in deconstructing complex topics such as fabrication nodes (e.g., 2nm process), the economic impact of policies like the CHIPS Act, and the strategic development of AI infrastructure from NVIDIA, AMD and Intel.

Follow Wccftech on Google to get more of our news coverage in your feeds.

China Doesn’t Need “Cutting-Edge” Accelerators To Progress With AI; DeepSeek’s Newest “FlashMLA” Project Now Brings In 8x TFLOPS Power Boost With NVIDIA’s H800 GPUs

DeepSeek's FlashMLA Will Help China's AI Industry To Squeeze Out Maximum Power From NVIDIA's Cut-Down Hopper GPUs

Related Story China’s ‘New Way’ of Breaking Into NVIDIA’s CUDA Moat Isn’t by Building a Replica: It’s by Changing the Way We See Hardware

Further Reading

NVIDIA's Blackwell B200 Chips Are Reportedly Accessible to China's ByteDance Through 'Rental' Compute Services

NVIDIA’s ‘Recent’ China H200 AI Chip Approval Brings More Constraints Than Opportunities Under the New, Tougher Guardrails

China Is Expected to Significantly Increase AI Chip Production in the Coming Years as Beijing Aims for '100% Self-Sufficiency' By 2027

NVIDIA’s Emerging AI Chip Rival in China, Cambricon, Plans to Raise $560 Million to Boost Competition as Beijing Moves to Mandate Homegrown AI Chips For Datacenters