Exploring LLM with MLX and Neural Accelerators on M5 GPUs

Macs powered by Apple Silicon are becoming increasingly popular among AI developers and researchers who want to use Macs to experiment with the latest models and technologies. MLX allows users to efficiently explore and run LLM on Mac. This allows researchers to experiment with new inference and fine-tuning techniques and explore AI techniques on their own hardware in a private environment. MLX works with all Apple Silicon systems and the latest macOS beta releases.¹now takes advantage of the neural accelerator in the new M5 chip introduced in the new 14-inch MacBook Pro. Neural Accelerators provide specialized matrix multiplication operations that are critical to many machine learning workloads and enable even faster model inference experiences on Apple silicon, as demonstrated in this post.

What is MLX?

MLX is an open source array framework that is efficient, flexible, and highly tuned for Apple silicon. MLX can be used for a wide range of applications, from numerical simulation and scientific computing to machine learning. MLX has built-in support for neural network training and inference, including text and image generation. MLX makes it easy to generate and fine-tune text using large language models on Apple Silicon devices.

MLX leverages Apple Silicon’s unified memory architecture. MLX operations can be performed on either the CPU or GPU without moving memory. This API closely follows NumPy and is both familiar and flexible. MLX also has higher-level neural net and optimizer packages, along with automatic differentiation and function transformations for graph optimization.

Getting started with MLX in Python is easy:

pip install mlx

Please see the documentation for more information. MLX also has many samples that can serve as entry points for building and using many common ML models.

MLX Swift is built on the same core libraries as the MLX Python front end. It also includes several examples to help you get started developing machine learning applications in Swift. If you prefer something low-level, MLX provides easy-to-use C and C++ APIs that run on any Apple silicon platform.

Running LLM on Apple Silicon

MLX LM is a package built on top of MLX for text generation and language model fine-tuning. This allows you to run most of the LLMs available on Hugging Face. MLX LM can be installed as follows:

pip install mlx-lm

You can also simply call and start chatting in your favorite language model. mlx_lm.chat Inside the terminal.

MLX natively supports quantization. Quantization is a compression approach that reduces the memory footprint of language models by storing model parameters with less precision. use mlx_lm.convertmodels downloaded from Hugging Face can be quantized in seconds. For example, quantizing a 7B Mistral model to 4 bits takes only a few seconds with a simple command.

mlx_lm.convert \
  --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
  -q \
  --upload-repo mlx-community/Mistral-7B-Instruct-v0.3-4bit

Inference performance on M5 using MLX

The GPU neural accelerator introduced in the M5 chip provides dedicated matrix multiplication operations that are important for many machine learning workloads. MLX leverages the Tensor Operations (TensorOps) and Metal Performance Primitives frameworks introduced in Metal 4 to support Neural Accelerators capabilities. To illustrate the performance of the M5 with MLX, we benchmark a set of LLMs of various sizes and architectures running on a MacBook Pro with an M5 and 24 GB of unified memory, and compare it to a similarly configured MacBook Pro M4.

Evaluates Qwen 1.7B and 8B with native BF16 accuracy, and evaluates 4-bit quantized Qwen 8B and Qwen 14B models. Additionally, we run two Mixture of Experts (MoE) benchmarks: Qwen 30B (3B active parameters, 4-bit quantization) and GPT OSS 20B (native MXFP4 precision). The evaluation is performed as follows mlx_lm.generatereported in terms of time to first token generation (in seconds) and generation rate (tokens/second). For all these benchmarks, the prompt size is 4096. Generation speed was evaluated when 128 additional tokens were generated.

Model performance is reported in terms of time to first token (TTFT) and corresponding acceleration for both M4 and M5 MacBook Pros.

Time to first token (TTFT)

Figure 1: TTFT in seconds for various LLMs running in MLX on M4 and M5 MacBook Pros (lower is better). The speedup value is listed below each model name.

In LLM inference, the generation of the first token relies on computation and takes full advantage of neural accelerators. M5 reduces time to first token generation to less than 10 seconds on dense 14B architectures and less than 3 seconds on 30B MoE, delivering strong performance on these architectures on MacBook Pro.

Subsequent token generation is limited by memory bandwidth rather than computational power. For the architectures tested in this article, M5 delivered a 19-27% performance improvement compared to M4 due to its higher memory bandwidth (28% higher at 120GB/s for M4 and 153GB/s for M5). In terms of memory footprint, the MacBook Pro 24GB can easily hold 8B with BF16 precision or 30B with quantized MoE 4-bit, keeping inference workloads under 18GB on both of these architectures.

	TTFT speedup	Faster generation	Memory (GB)
Quen 3-1.7B-MLX-bf16	3.57	1.27	4.40
Quen 3-8B-MLX-bf16	3.62	1.24	17.46
Qwen3-8B-MLX-4bit	3.97	1.24	5.61
Qwen3-14B-MLX-4bit	4.06	1.19	9.16
gpt-oss-20b-MXFP4-Q4	3.33	1.24	12.08
Qwen3-30B-A3B-MLX-4bit	3.52	1.25	17.31

Table 1: Inference speedup achieved with different LLMs using MLX on M5 MacBook Pro (compared to M4), TTFT and subsequent token generation, and corresponding memory demands. TTFT is compute dependent, but generation is memory bandwidth dependent.

GPU Neural Accelerator leverages MLX for ML workloads involving large matrix multiplications to accelerate time to first token for language model inference by up to 4x compared to the M4 baseline. Similarly, generating a 1024×1024 image with FLUX-dev-4bit (12B parameters) using MLX is over 3.8x faster on M5 than on M4. We continue to add features and improve the performance of MLX, and look forward to new architectures and models that the ML community explores and runs on Apple silicon.

Get started with MLX:

[1] MLX works on all Apple Silicon systems and is easy to install by: pip install mlx. To take advantage of M5’s Neural Accelerators enhanced performance, MLX requires macOS 26.2 or later

Source link