
A new post on Apple’s Machine Learning Research blog shows how much the M5 Apple silicon improves over the M4 when it comes to running local LLM. Here are the details:
a little context
A few years ago, Apple released MLX. The company describes it as “an array framework that enables efficient and flexible machine learning on Apple silicon.”
In fact, MLX is an open source framework that helps developers build and run machine learning models natively on Apple Silicon Macs, and is supported by APIs and interfaces that are well-known in the AI world.
Apple once again explains MLX.
MLX is an open source array framework that is efficient, flexible, and highly tuned for Apple silicon. MLX can be used for a wide range of applications, from numerical simulation and scientific computing to machine learning. MLX has built-in support for neural network training and inference, including text and image generation. MLX makes it easy to generate and fine-tune text using large language models on Apple Silicon devices.
MLX leverages Apple Silicon’s unified memory architecture. MLX operations can be performed on either the CPU or GPU without moving memory. This API closely follows NumPy and is both familiar and flexible. MLX also has higher-level neural net and optimizer packages, along with automatic differentiation and function transformations for graph optimization.
One of the MLX packages currently available is MLX LM. This is for generating text and fine-tuning language models on Apple Silicon Macs.
MLX LM allows developers and users to download most models available in Hugging Face and run them locally.
This framework also supports quantization. Quantization is a compression method that allows you to run large models while reducing memory usage. This speeds up inference. This is basically the step where the model generates an answer to an input or prompt.
M5 vs M4
In a blog post, Apple showcases the new M5 chip’s improved inference performance thanks to the chip’s new GPU neural accelerator.[s] Dedicated matrix multiplication operations are important for many machine learning workloads. ”
To illustrate the performance improvement, Apple used MLX LM to compare the time it takes from receiving a prompt to generating the first token on M4 and M5 MacBook Pros.
Or as Apple puts it:
Evaluates Qwen 1.7B and 8B with native BF16 accuracy, and evaluates 4-bit quantized Qwen 8B and Qwen 14B models. Additionally, we run two Mixture of Experts (MoE) benchmarks: Qwen 30B (3B active parameters, 4-bit quantization) and GPT OSS 20B (native MXFP4 precision). The evaluation is performed with mlx_lm.generate and is reported in terms of time to first token generation (in seconds) and generation rate (in tokens/second). For all these benchmarks, the prompt size is 4096. Generation speed was evaluated when 128 additional tokens were generated.
Here are the results:

One important detail here is that LLM inference takes a different approach to generating the first token than how it works under the hood to generate subsequent tokens. In a nutshell, initial token inference relies on computation, but subsequent token generation relies on memory.
This is why Apple also evaluated the speed of generation of 128 additional tokens as mentioned above. Additionally, in general, M5 showed a 19-27% performance improvement compared to M4.

Here’s what Apple says about these results:
For the architectures tested in this article, M5 delivered a 19-27% performance improvement compared to M4 due to its higher memory bandwidth (28% higher at 120GB/s for M4 and 153GB/s for M5). In terms of memory footprint, the MacBook Pro 24GB can easily hold 8B with BF16 precision or 30B with quantized MoE 4-bit, keeping inference workloads under 18GB on both of these architectures.
Apple also compared the performance differences in image generation and said the M5 performed processing more than 3.8 times faster than the M4.
You can read Apple’s full blog post here and learn more about MLX here.
Accessories sale on Amazon
FTC: We use automated affiliate links that generate income. more.
