iPhone 17 Pro Successfully Demonstrated Running A 400B Large Language Model, A Feat That Requires Minimum Of 200GB Memory Even When Compressed

Mar 23, 2026 at 06:21am EDT
An iPhone 17 Pro shown to run a 400B LLM

Large Language Models with 400 billion parameters can only be run on capable hardware with heaps of memory, as even a quantized or compressed version requires a minimum of 200GB RAM. Looking at these beefy requirements, the iPhone 17 Pro would never ever be the first choice to run a 400B LLM, but video evidence shows otherwise, as one person has demonstrated that Apple’s current generation has accomplished the impossible. However, it should be noted that this feat couldn’t be achieved without some clever tricks, so let us view those details.

As one would expect, the iPhone 17 Pro can only generate 0.6 tokens per second, but even overcoming this daunting challenge was impressive

An open-source project called Flash-MoE was running on an iPhone 17 Pro, with @anemll showing that while the flagship can run the insanely taxing model, it’s not without its disadvantages. For one thing, if you didn’t already notice in the video below, the token speed is dreadfully slow at 0.6t/s, which is around one word being generated every 1.5 to 2 seconds.

Related Story An iPhone 17 Pro Was Used To Demo Gemini Spark Instead Of A Pixel, Showing Google Executives’ Preferences Lie Far From The Company’s Own Products

Assuming you have sufficient patience or can keep yourself busy with other tasks while the iPhone 17 Pro generates the query for you, we think many users will begin to pull out their hair when they witness this sluggish performance. Then again, the fact that a 400B LLM was running on a smartphone, regardless of the speeds, indicates that with a few more optimizations, it is more than possible to run on-device Large Language Models on handsets.

As for how this was accomplished, instead of loading the whole LLM into the memory, which would be impossible as the iPhone 17 Pro only ships with 12GB of LPDDR5X RAM, Flash-MoE is leveraging the device’s SSD to stream directly to the GPU. Also, ‘MoE’ stands for Mixture of Experts model, so it only requires a fraction of those 400B parameters for each word it generates.

Another benefit is that you’re getting 100 percent privacy when using a localized LLM while obtaining responses without the use of an active internet connection, though the iPhone 17 Pro’s battery will be heavily taxed. Developers also resort to compressed, or ‘Quantized’ versions of these Large Language Models, but one with 400 billion parameters would require a minimum of 200GB of RAM, making it impossible to run on the iPhone 17 Pro.

In short, the latest demonstration shows that if you’re willing to sit through the painstaking process of generating queries at 0.6 tokens per second, you can run a 400B LLM on a smartphone. Then again, there’s a huge difference between running a Large Language Model and firing it up in a usable fashion.

News Source: @anemll

About the author: Omar Sohail is a reporter and analyst for Wccftech's mobile section, specializing in the technology and business of the mobile industry. His expertise lies in the intricate hardware supply chain, covering developments in semiconductor manufacturing, chip lithography, and camera sensor technology.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Products mentioned