● More resource-efficient small language models (SLMs) offer a promising alternative to today’s large general-purpose LLMs.
● Mixture of Experts (MoE), model fusion, and search augmented generation (RAG) are some of the techniques being explored to build AI compact enough for smartphone and edge computing deployments.
● However, energy rebound effects can still occur in smaller models with more limited operating capacity.
As researcher Sascha Luccioni pointed out in a June 2025 interview with Hello Future, the carbon footprint of today’s highly powerful large-scale language models (LLMs) is increasingly a cause for concern. Alternative approaches focusing on lighter and more specialized small language models could help alleviate this problem. “When we talk about AI, we often think of one monolithic model that can do everything, but this versatility comes at a price.” These small-scale models can be enhanced with techniques such as search augmentation generation (RAG) and tool invocation that enable interaction with external resources to effectively manage customer support or power specific modules for personalized learning.
Just like humans, we can provide AI with the ability to remember past interactions.
Potential and operational challenges of personalization
“However, it is important to point out.” Gwenore Le Colve added: “Personalization and downsizing are two very different things, even though they may overlap.” Personalization doesn’t necessarily require creating dedicated models for each user. Instead, add “Overlays that adapt model behavior to specific themes, contexts, and users.” These additional layers guide the generation process without touching the core model.
The need for this approach stems from a major hurdle: memory. “A model’s capabilities are fixed. In other words, when a model learns something new, some of its existing knowledge and capabilities may be eroded.” Unlike the human brain, we are still trying to figure out which specific parts of the transformer model encode knowledge and skills. Turning to external sources of knowledge, such as RAGs and deep research, facilitates this exploration.
By storing interaction history (in raw or encoded format), these technologies can be used to create knowledge sources that can be accessed.. “If you interact with the model frequently” The researcher explains: “Just like humans, we can give humans the ability to remember past interactions.” This ability to remember takes the form of external systems that models can leverage to better contextualize and personalize responses.
Modularity and hybrid architecture
As Gwenolé Lecorvé explains, there are other approaches that can be used to build smaller, more specialized AI systems, particularly mixed-expertise (MoE) architectures. “Instead of relying on one large network, MoE combines several specializations or sub-models of experts. Only a small number of these experts are activated for a single question, reducing compute and memory requirements.” Certain versions of Llama 4 and Qwen3, GPT-OSS, and Mistral already use this approach.
In reality, however, researchers warn that the alluring prospect of improved power efficiency through MoE could ultimately lead to an energy rebound effect. “As experts at the Ministry of Education, there is nothing stopping us from deploying very large models, for example up to 20 billion parameters.” This is exactly the strategy espoused by Yann LeCun with the JEPA (Joint Embedding Predictive Architecture) model, which is designed to understand and predict physical behavior in the real world. Although such models require only 62 hours of direct interaction data to learn how to navigate new situations, this “fine-tuning” window relies on a large foundation of existing training that is already consuming large amounts of data.
In addition to this myriad of alternatives (many of which are not yet fully implemented), there is also the possibility of integrating models. As Gwenolé Lecorvé explains, this requires combining specialized models into a more comprehensive system. “This is still an experimental approach because we don’t know how the internal parameters will interact. Combining two models is like superimposing two drawings; we cannot guarantee that the results will match.”
What about the future?
Researchers believe that the key to a lean and powerful LLM is hybridity. “Like a hybrid car, you can develop multi-layered systems that switch between different engines: small models for basic tasks, medium models for moderate difficulty, and large models for complex requests.” Routing mechanisms for distributing workloads are already built into the architecture of models such as GPT-5. “A small model can handle 90% of its tasks locally on the smartphone, and the remaining 10% of requests can be offloaded to a more powerful model over the network.” This is an intelligent approach that can reduce the strain on the system and only needs to be implemented. As Gwenolé Lecorvé points out, we already have all the building blocks, we just need to put them together. Of course, “specialized knowledge and craftsmanship will be required,” but the future is clear.

