Apple has finally debuted CoreAI, which is a successor to its CoreML engine that had reigned supreme for around 9 years, bringing with it format-agnostic inferencing and support for large-model memory footprints. Even so, initial tests are painting a much more nuanced picture of Apple's new AI framework and, in turn, its on-device models.
New benchmark tests show Apple's CoreAI "converges to a near-tie [with MLX] at a realistic 8B" model size for decoding
For the benefit of those who might not be aware, Apple launched its CoreML machine-learning framework back in 2017 to primarily run smaller, static machine-learning tasks such as image classification and tree ensembles. CoreAI is CoreML's brand-new successor that has been optimized for edge AI and on-device inference.
In contrast, MLX is an engine that is primarily geared towards research, training, and fine-tuning, and is locked to Apple's Metal GPU and unified memory architecture.
Now, a new benchmark test has just given us interesting insights into Apple's new CoreAI engine.
Firstly, for small models such as the 0.6-billion-parameter Qwen3, CoreAI is around 2.47x faster on decoding tasks than MLX on an M4 Mac. Similarly, on an iPhone 17 Pro, CoreAI is around 1.6x faster than MLX on decoding, again based on the Qwen3 0.6b model. However, when model size increases to a more practical 8 billion parameters (Qwen3 8b, M4 Max Mac), CoreAI is only 1.05x faster than MLX, and offers a near-parity decoding performance.
Interestingly, on sustained workloads on the iPhone 17 Pro, the GPU throttles relatively quickly, allowing the CoreML/Apple Neural Engine combo to sprint ahead in terms of performance retained. This combo also consumes the smallest memory, but is also the slowest at decoding tasks.
Engines optimized to specific vendor-sourced models almost always trump general engines. For instance, Google's LiteRT-LM engine running its Gemma model was not only the fastest engine on the iPhone 17 Pro (55.4 tokens per second), but it also used 4.5× less RAM than Apple's own MLX framework (641 MB vs 2,900MB).
Finally, Apple Foundation Models were found to be "2× more energy-efficient per token than the GPU-backed runtimes, 4× more than CoreML/ANE."
Follow Wccftech on Google to get more of our news coverage in your feeds.
