FASTVLM: Efficient vision encoding of vision language models

Vision Language Model (VLM) allows for visual understanding along with text input. These are usually constructed by passing visual tokens from preprocessed vision encoders to preprocessed leading language models (LLMs) via the projection layer. By leveraging the rich visual representation of Vision encoder and the knowledge and inference capabilities of LLM, VLMS is useful for a wide range of applications, including accessibility assistants, UI navigation, robotics, and gaming.

The accuracy of VLMs is generally improved with higher input image resolution, creating a trade-off between accuracy and efficiency. In many production use cases, VLMs need to be accurate and efficient to run devices for AI experiences that meet the low latency demands of real-time applications and provide privacy.

In a paper accepted in CVPR 2025, Apple ML researchers recently shared new techniques to address this challenge. FastVLM is a new type of VLM that greatly improves the trade-off of precision delays with a simple design. Leveraging a hybrid architecture visual encoder designed for high resolution images, FastVLM provides accurate, fast and efficient visual query processing, making it suitable for powering on real-time applications. Inference codes based on MLX, model checkpoints, and iOS/MacOS demo apps are available here.

Image resolution and accuracy delay trade-off

In general, VLM accuracy improves with higher image resolution, especially for tasks that require detailed understanding, such as document analysis, UI recognition, and answering natural language queries about images. For example, in Figure 1 below, we ask the VLM about the signs that appear in the image. On the left side, the model receives a low resolution image and cannot respond correctly. On the right side, the VLM receives high resolution images and correctly identifies the traffic sign “do not enter”.

Figure 1: Comparison of VLM performance with low-resolution (left) and high-resolution (right) input images.

High resolution significantly increases the time until the first token in VLMS. High-resolution images improve accuracy, but there are two ways to reduce efficiency. 1) High resolution images take longer for the Vision encoder to process, and 2) the encoder creates more visual tokens and increases the pre-fill time of the LLM. Both factors increase the time to time (TTFT). This is the sum of the time that codes your vision and the fill time for LLM. As shown in Figure 2 below, both visual encoding and LLM burn time increase as image resolution increases, and at high resolution visual encoder latency becomes the dominant bottleneck. To address this, our study introduces FastVLM, a new vision language model that greatly improves efficiency without sacrificing accuracy.

Latency breakdown

1.5B VLM (FP16)

Figure 2: Visual latency dominates at high resolution. Breakdown of FastVLM times to first tokens for different image resolutions. The Vision encoder is FastVit-HD, and LLM has 1.5B parameters.

Hybrid vision encoder provides the best accuracy delay trade-offs

To identify which architectures provide the best accuracy delay trade-off, we systematically compared existing pre-trained vision encoders, all (training data, recipes, LLM, etc.) were kept the same, and only the vision encoders were changed. In Figure 3 below, the X-axis shows the TTFT and the y-axis shows the average accuracy across different VLM tasks. Two points are displayed on popular transformer-based encoders VIT-L/14 and Siglip-SO400. It is pretrained with image text data at native resolution. It also shows the curves of Convnext (Fully Convolutional Encoder) and FastVit (Hybrid Encoder combining convolutional and transformer blocks) at various resolutions. FastVit is based on two previous works (FastVit, ICCV 2023, and MobileClip, CVPR 2024), achieving the highest accuracy range trade-off compared to other vision encoders, which are 8x and 20x faster than the VIT-L/14.

Performance comparison with FastVit

Figure 3: Comparison of different vision architectures for visual encoding in VLMS. All vision encoders are pre-trained with clips and trained using the same setup (dataset, recipe, LLM size). FastVit hybrid architecture achieves the highest accuracy delay trade-offs. The AVG-5 is the average performance of models from GQA, TextVQA, DOCVQA, SeedBench and Pope Benchmarks.

FastVithd: The Best Vision Encoder for VLMS

The FastVit hybrid backbone is ideal for efficient VLMS, but a larger vision encoder is required to improve accuracy for challenging tasks. Initially, I simply increased the size of each high vit layer. However, this naive scaling makes the faster presence even more efficient than high-resolution encoders. To address this, we designed a new backbone, FastVithd, especially for high-resolution images. FastVithd includes additional stages compared to FastVit, pre-trained using MobileClip recipes to generate fewer but higher quality visual tokens.

FastVithD has better latency on high resolution images compared to FastVit, but to evaluate what's best in VLM, we compared performance when paired with LLM of different sizes. Different pairs (image resolution, LLM size) of 0.5b, 1.5b, and 7b parameters (corresponding to each curve in Figure 4 below) and three LLMs were evaluated and paired with a Vision backbone running at different resolutions.

As shown in Figure 4, using very high resolution images with small LLMs is not always the best choice. Instead of increasing the resolution of LLM, it may be better to switch to a larger one. For each case, we present a dashed Pareto optimal curve. This indicates the best (image resolution, LLM size) for a particular runtime budget (here, TTFT). Comparing the PARETO-OPTIMAL curves, FASTVLM (based on FastVithD) offers a much better accuracy extension trade-off than FastVit-based models. It can be up to three times faster with the same accuracy. Note that FastVit has already shown to be significantly better than purely transformer-based or convolution-based encoders.

Pareto optimal curve based on model size

Figure 4: Comparison of FastVit and FastVit-HD backbones paired with LLMs of different sizes and LLMs of different image resolutions. The dashed line shows the Pareto optimal curves for both vision backbones. Note that the X-axis is log scale. The AVG-5 is the average performance of models from GQA, TextVQA, DOCVQA, SeedBench and Pope Benchmarks.

FastVLM: A new VLM based on FastVithD

FastVithd is a hybrid convolution transformer architecture that includes a convolution stem, three convolution stages, and two subsequent stage transformer blocks. Before each stage there is a patch-embedded layer that reduces the spatial dimensions of the input tensor by 2 times. Using FastVithD as the Vision encoder, FastVLM was constructed using a simple multi-layer perceptron (MLP) module to project the visual token into the embedded space of LLM, as shown in Figure 5.

Figure 5: Overview of FastVLM architecture. FastVLM features FastVit-HD, a new vision encoder that incorporates multi-scale pooling, an additional autocatalytic layer and downsampling, generating four times fewer tokens than the VIT-L/14 and 16xToken at a resolution of 336.

FastVLM outperforms the token pruning and merge methods

Previous studies of accelerated VLM have adopted complex merge or pruning techniques to reduce visual token counts and speed up LLM reserves (and therefore reduce the time to the first token. As shown in Figure 6 below, FASTVLM achieves higher overall accuracy with different visual token counts (corresponding to different input resolutions) compared to these approaches. This is due to high-quality visual tokens from the FastVithD encoder, and FastVLM does not require complex token pruning or merging, making it easier to deploy.

Figure 6: Comparison of average performance of FASTVLM at different input image resolutions, corresponding to different numbers of visual tokens, and corresponds to different token pruning and merge methods. The Y-axis is the average performance of models from GQA, TextVQA, ScienceQA, SeedBench and Pope benchmarks.

FastVlm and Dynamic Tiles

As mentioned before, VLM accuracy increases with input resolution, especially for tasks where granular details need to be understood. Dynamic tiles (for example, in any case) are a common way to process very high resolution images. This approach divides the image into smaller tiles, processes each tile individually via a Vision encoder, and sends all tokens to LLM as shown below.

Figure 7: Dynamic tile approach. — Figure 7: AnySRES tiles individually encode different subregions of images (tiles) and pass all tokens to LLM, along with the low-resolution version of the full image.

Since FastVLM naturally processes high-resolution images, we considered whether combining FastVLM with dynamic tiles would improve the trade-off of reduced accuracy. Figure 8 below shows that FastVLM without tiles (blue curves) achieves a better trade-off of accuracy range compared to dynamic tiles (pink points) up to very high image resolution.

Tile effects on performance

Figure 8: FastVLM Dynamic Tiles (AnySRES) is only best when using fewer tiles (2×2) at the highest resolution. Tile grid sizes are specified in parentheses. Note that the X-axis is log scale. The AVG-5 is the average performance of models from GQA, TextVQA, DOCVQA, SeedBench and Pope Benchmarks.

FastVLM is faster and more accurate than typical VLMs of the same size

Finally, we compared FastVLM with other popular VLMs. Figure 9 below shows two curves for FASTVLM. One is tested in any case (to achieve the highest accuracy), the other is tested in three different LLM sizes. FastVLM is significantly faster and more accurate than typical models of the same size, as indicated by the arrows. It's 85 times faster than Llava-onevision (0.5b LLM), 5.2 times faster than Smolvlm (~0.5b LLM), and 21 times faster than Cambrian-1 (7b LLM).

Comparison of performance by model size

Figure 9: Comparison of FastVLM and popular VLM. The arrows show comparisons with similar sized VLMs, highlighting the great accuracy and performance of FastVLMs is significantly faster. The Y-axis is the average performance of the models from Chartqa, TextVQA, DocVQA, OCRBench, AI2D, MMMU, and ScienceQA benchmarks.

To further demonstrate the on-device efficiency of FastVLM, we have released an iOS/MacOS demo app based on MLX. Figure 10 shows an example of FastVLM running locally on an iPhone GPU. FASTVLM's near-real-time performance enables new on-device features and experiences.

Figure 10: Demo app running FastVLM 0.5B model on iPhone 16 Pro. The time until the first token is displayed on the screen, highlighting near-real-time performance.

Conclusion

By combining visual and textual understanding, VLM can enhance a variety of useful applications. Because the accuracy of these models generally corresponds to the resolution of the input image, there is often a performance trade-off between accuracy and efficiency, limiting the value of VLMS in applications that require both high accuracy and excellent efficiency.

FASTVLM addresses this trade-off by leveraging a hybrid architecture vision encoder built for high-resolution images. With Simple Design, FastVLM surpasses previous approaches in both accuracy and efficiency, allowing for visual query processing of devices suitable for real-time on-device applications.

Source link

FASTVLM: Efficient vision encoding of vision language models