A new approach to continuous learning

Google has introduced Nested Learning, a new machine learning approach designed to address the devastating forgetting challenge in continuous learning. Announced in a paper at NeurIPS 2025, Nested Learning restructures a single model as a system of interconnected multilevel optimization problems that are optimized simultaneously. This paradigm bridges the gap between model architectures and optimization algorithms by treating them as essentially the same concept (different levels of optimization with their own context flows and update rates), ultimately enabling greater computational depth and improved long-context memory management.

Nested learning: a new machine learning paradigm

Nested learning presents a new machine learning paradigm by viewing models as interconnected multilevel optimization problems. This approach bridges the gap between a model’s architecture and its training algorithm, recognizing them as different “levels” of optimization. Recognizing this inherent structure, nested learning introduces a previously unseen dimension to AI design, enabling the creation of learning components with increased computational depth, ultimately addressing problems such as catastrophic forgetting.

The core of nested learning is recognizing that a complex model is actually a set of nested, or parallel, optimization problems, each with its own “context flow” or separate set of information for learning. This perspective suggests that current deep learning methods compress these internal context flows. Additionally, this source highlights that concepts such as backpropagation and attention mechanisms can be formalized as simple associative memory modules, revealing a uniform and reusable structure for model design.

As a proof of concept, the “Hope” architecture was developed using nested learning principles. Hope is a self-modifying recurrent architecture powered by a “continuous memory system” (CMS), a spectrum of memory modules that are updated at different frequencies. Experiments demonstrate that Hope is less disruptive and more accurate in language modeling and common sense reasoning tasks compared to standard recurrent models and transformers, validating the power of nested learning paradigms.

Dealing with catastrophic forgetting in machine learning

Nested learning provides a new approach to machine learning designed to address “catastrophic forgetting,” where learning a new task degrades performance on an old task. This paradigm views a single model not as a single process, but as a system of interconnected multilevel optimization problems. Recognizing that the model architecture and training rules are essentially the same, nested learning allows you to build learning components with greater computational depth. This reflects the neuroplasticity of the human brain and is key to reducing forgetting and enabling continuous learning.

The core of nested learning lies in defining update frequencies for each interconnected optimization problem and ordering them into levels. This framework extends the memory concept of models such as transformers to create a “continuous memory system” (CMS). CMS views memory as a spectrum of modules, each updated at a specific rate, allowing for a richer and more effective system for continuous learning compared to standard approaches with limited update levels.

As a proof of concept, researchers designed “Hope,” a self-modifying architecture that leverages nested learning and CMS blocks. Experiments demonstrate Hope’s superior performance in language modeling and long-context inference, with lower complexity and higher accuracy than state-of-the-art recurrent models and standard transformers. This demonstrates the power of nested learning and the potential to build more capable AI systems that can continuously learn without forgetting.

We believe that the nested learning paradigm provides a solid foundation for bridging the gap between the limited forgetting nature of current LLMs and the remarkable continuous learning capacity of the human brain.

Nested learning paradigm and model design

Nested learning views machine learning models not as a single process, but as a multilevel optimization problem that is interconnected and operating simultaneously. This paradigm recognizes that the architecture of the model and its training rules are essentially the same, with different “levels” of optimization, each with its own “context flow” and update rate. By understanding this inherent structure, nested learning enables the design of AI components with greater computational depth and addresses problems such as catastrophic forgetting by enabling multi-timescale updates.

Nested learning approaches reveal that existing deep learning methods inherently compress internal context flows. The researchers discovered that the training process, specifically backpropagation, can be modeled as an associative memory, mapping data points to local error values. Key architectural components, such as the attention mechanism of the transformer, can also be formalized as simple associative memory modules. This allows the creation of a “continuous memory system” (CMS) where memory is a spectrum of modules, each updated at its own frequency.

As a proof of concept, a self-modifying variant of the Titans architecture, the “Hope” architecture, was designed using nested learning principles. Hope leverages unlimited levels of in-context learning and CMS blocks to scale to larger context windows, essentially optimizing its own memory through a self-referential process. Experiments demonstrate that Hope is less disruptive and more accurate in language modeling and common sense reasoning tasks compared to state-of-the-art recurrent models and standard transformers.

Implementing nested learning: deep optimizers and contiguous memory

Nested learning proposes a new machine learning paradigm by viewing models as interconnected multilevel optimization problems. This approach bridges the gap between a model’s architecture and its training algorithm, recognizing them as different “levels” of optimization. By recognizing this inherent structure, Nested Learning aims to create more capable AI with greater computational depth and ultimately address problems such as catastrophic forgetting, where learning new information compromises retention of old knowledge.

The concept of a “continuous memory system” (CMS) is central to nested learning. Traditional transformers utilize short-term memory (sequence models) and long-term memory (feedforward networks). CMS extends this by envisioning memory as a spectrum of modules, each updated at a specific frequency. This allows different modules to prioritize and retain information at different timescales, allowing for a richer and more effective memory system suitable for continuous learning and improving overall learning capacity.

As a proof of concept, the researchers developed a self-correcting architecture called “Hope” based on the principles of nested learning. A variation on the Titans architecture, Hope leverages unlimited levels of in-context learning and incorporates CMS blocks to handle larger context windows. Experiments validate the power of this new paradigm by demonstrating that Hope is less disruptive and more accurate in language modeling and common-sense reasoning tasks compared to state-of-the-art recurrent models and standard transformers.

Hope: Self-modifying architecture and experimental results

Hope is a self-modifying recurrent architecture designed as a proof of concept for the nested learning paradigm. Built as a variant of the Titans architecture, it leverages unlimited levels of in-context learning and incorporates contiguous memory system (CMS) blocks to handle larger context windows. Unlike standard Titan, which only has two levels of parameter updates, Hope is designed to optimize its own memory through a self-referential process, essentially creating a loop learning level.

Experiments were conducted to verify the performance of nested learning, CMS, and Hope across language modeling, long context inference, continuous learning, and knowledge embedding. The results, detailed in a paper published by the researchers, show that Hope is less disruptive and more accurate than state-of-the-art recurrent models and standard transformers on a variety of language modeling and common-sense reasoning tasks.

The core innovation behind Hope lies in its ability to create a “continuous memory system” (CMS). This system views memory as a spectrum of modules, each updating at a different, specific frequency rate. This goes beyond the short-term/long-term division found in the standard Transformer that separates sequence modeling and feedforward networks, and allows for a richer and more effective memory system for continuous learning.

When it comes to continuous learning and self-improvement, the human brain is the gold standard. It adapts through neuroplasticity, the remarkable ability to change its structure in response to new experiences, memories, and learning.

Source link