Apple announced several improvements aimed at improving the deployment and performance of on-device AI models at the 2024 Worldwide Developers Conference (WWDC). Important changes include significant enhancements to the Core ML tools that shipped in pre-release 8.0b1.
These updates are designed to improve the efficiency and effectiveness of deploying machine learning (ML) models on Apple devices. Here we discuss the details of these innovations, their impact for developers, and the benefits for end users.
Explanation of important terms
Before we dive into the details of the update, let’s clarify some important terminology:
Paletted
This technique reduces the precision of the model weights by grouping them into clusters and representing each cluster with a single value. This is like having a range of different colors in a painting where one color is used to represent the entire range of colors. In machine learning, palettization compresses the values of the weights, significantly reducing the size of the model.
Quantization
Quantization is the process of reducing the precision of weights and activations from floating-point numbers, such as 32-bit floats, to lower precision numbers, such as 8-bit integers. This compression technique helps reduce the size of models and speeds up inference by speeding up computations on lower-precision hardware.
Block-wise quantization
This variation of quantization splits the model weights into smaller blocks or chunks and quantizes each block separately, which allows for more precise quantization on each chunk, resulting in improved accuracy.
pruning
It is a data compression technique that removes unimportant weights in a model and those that have the least impact on the model's prediction. In the process, the least important weights are set to zero, which can be efficiently stored using a sparse matrix representation.
Stateful Model
A stateful model is one that keeps track of information that needs to be passed across multiple runs of the model – that is, it maintains the context and its state. This is especially important in tasks like language modeling, where the model needs to remember previously generated words in order to generate the next text correctly and coherently.
Core ML Tools (Core ml tools) is a Python package for converting third-party models into a format suitable for Core ML, Apple's framework for integrating machine learning models into apps. Core ML Tools supports conversion from popular libraries such as TensorFlow and PyTorch into the Core ML model package format.
of Core ml tools Using packages you can:
- Convert trained models from various libraries and different frameworks into the Core ML model package format.
- Read, write, and optimize Core ML models to reduce storage space, lower power consumption, and minimize inference latency.
- Use Core ML on MacOS to make predictions and validate the creation and transformation.
Core ML provides a unified representation for all models, so apps can use Core ML APIs and user data to make predictions and fine-tune models directly on the user's device. This approach eliminates the need for a network connection, keeps user data private, and makes apps more responsive. Core ML leverages the CPU, GPU, and Neural Engine (NE) to optimize on-device performance while minimizing memory footprint and power consumption.
Now that we've covered the theory and terminology, let's dive into the new features and changes in the upcoming version 8.0b1 of the Core ML tools.
New Utility and Stateful Model
Introduction of coremltools.utils.MultiFunctionDescriptor()
and coremltools.utils.save_multifunction
It simplifies writing ML programs with multiple functions that can share weights with each other, making your models more versatile and easier to use as you can easily load a specific function for prediction.
Core ML has been enhanced to support stateful models through recent changes to the converter to generate models in the new State Type introduced in iOS 18 and macOS 15. These models can maintain information across inference runs, making them particularly useful for tasks that require the model to remember inputs it has seen in the past.
Advanced Compression Techniques
The Core ML tools now offer an expanded range of compression features to reduce model size while maintaining performance. Updated coremltools.optimize
The module currently supports:
- Block-wise quantization: The model weights are split into small sections and quantized separately, allowing for more precise control over quantization.
- Paletted Grouped by Channel: Groups weights with similar values together, which reduces the number of unique weight values and allows for greater flexibility and precision.
- 4-bit weight quantization: This cuts storage needs in half compared to 8-bit quantization, further reducing model size.
- 3-bit Paletted: Expands the possible bit depth options for palettization, using only 3 bits to represent weight clusters, allowing for higher compression.
These techniques, in addition to joint compression modes such as 8-bit look-up tables (LUTs) for palettization, and weight pruning combined with quantization or palettization, provide efficient tools to reduce model size and improve performance.
Advanced API Improvements: Compression and Quantization
of coremltools.optimize
The module includes important API updates to support advanced compression techniques. For example, a new API for activation quantization based on calibration data can change a W16A16 Core ML model (16-bit weights and activations) to a W8A8 model (8-bit weights and activations) to improve efficiency while maintaining accuracy. Additionally, coremltools.optimize.torch
We've introduced a data-free compression method based on calibration data, making it easier to optimize PyTorch models for Core ML.
iOS 18/macOS 15 optimization
Modern operating systems support new operations, such as: constexpr_blockwise_shift_scale
, constexpr_lut_to_dense
and constexpr_sparse_to_dense,
These are essential for efficient model compression. Updated Gated Recurrent Unit (GRU) operations and added PyTorch scaled_dot_product_attention
Operations improve performance, allowing Transformer models and other complex structures to run properly on Apple Silicon. These updates ensure more efficient execution and better utilization of hardware capabilities.
Experimental Torch Export Conversion
of torch.export
Conversion support allows for seamless and direct conversion of models. Core ML
from PyTorch
.
This process includes:
-
Import the required libraries
-
To export a PyTorch model
torch.export
-
To convert the exported program into a Core ML model,
coremltools.convert
This simplified process reduces the complexity of deploying PyTorch models to Apple devices, taking advantage of the enhanced performance of Core ML.
Multi-function model
Multi-feature model integration in Core ML tools allows you to combine models with common weights into a single ML program. This is advantageous for applications that require multiple tasks, such as combining feature extractors with classifiers and regressors. save_multifunction
This utility ensures that shared weights are not duplicated, saving more storage space and performance.
Improved performance and bug removal
The new version 8.0b1 of Core ML Tools includes various bug fixes, enhancements, and optimizations for a smoother development experience, fixing known issues such as failed conversion and incorrect quantization scale in certain palettization modes, and improving the reliability and accuracy of compressed models.
Benefits for end users
The coremltools 8.0b1 pre-release enhancements provide several significant benefits to end users, improving the overall experience of AI-powered applications:
- Improved performanceSmaller, optimized models are lighter and load and run faster on devices, resulting in faster response and engagement, and smoother interactions.
- Reduce app size: Compressed models also use less space, making applications lighter and more storage efficient, which is especially helpful for users with limited space on mobile devices.
- Enhanced featuresMulti-function and stateful models enable apps to use more complex and innovative features, which can provide more sophisticated functionality and more intelligent behavior.
- Improved Battery LifeOptimizing model execution reduces energy consumption on mobile devices during intensive AI operations, improving battery life.
- Enhanced Privacy: Integrating artificial intelligence into the device allows for local processing of user data, eliminating the risk of data being sent to other external servers.
Conclusion
The pre-release of coremltools 8.0b1 represents a major step forward in on-device AI model deployment. Developers can create more efficient, compact, and versatile ML models with improved compression techniques, stateful model support, and multi-function model utilities. These advancements underscore Apple's commitment to providing robust tools to help developers harness the power of Apple silicon, ultimately enabling faster, more efficient, and more performant on-device AI applications.
As Core ML and its environments evolve, the possibilities for innovation in AI-powered apps will continue to expand and grow, opening the door to more sophisticated and user-friendly experiences.
In the next post, we'll demonstrate these new features in action on a sample project and show you how to apply them in a real-world scenario. Stay tuned!