Accelerate Gemini Nano models on Pixel using frozen multi-token predictions

Machine Learning


On-device models like Gemini Nano and Gemma make putting powerful large language models (LLMs) in your pocket a reality. This technology allows you to perform everyday functions on your phone, such as instantly summarizing large notifications or proofreading important text messages, without sending any private data from your device. However, for these features to be useful to everyday users, they must be performed very efficiently.

Achieving this kind of speed on mobile devices is a huge challenge. Unlike vast server environments, mobile phones operate under tight energy budgets and hard memory (RAM) limitations. Additionally, standard language models generate text “autoregressively.” That is, it processes and outputs only one word (or token) at a time. This gradual process can create bottlenecks that underutilize your phone’s processing power, tax its memory bandwidth, and ultimately lead to a poor user experience and drained battery.

To overcome this bottleneck, we present a new architecture that improves multi-token prediction (MTP) on the existing “frozen” Gemini Nano v3 model. Building on previous approaches such as the EAGLE framework and Confident Adaptive Language Modeling (CALM), we designed new architectural components that maximize these efficiency gains specifically for mobile environments. Recent announcements highlight the use of MTP to accelerate Gemma 4 and make it available to developers.

Today’s article explores the extreme limitations inherent in edge computing. This approach, recently introduced to the Pixel 9 and 10 series, works as an out-of-the-box speedup. For users, this means they can generate text much faster with less energy consumption, with features like AI notification summaries and proofreading. For developers, it solves a huge pain point by providing fast, on-device AI without having to fine-tune separate, memory-intensive drafting models for each new task.



Source link