The past decade has seen tremendous advances in machine learning (ML), primarily due to powerful neural network architectures and the algorithms used to train them. However, despite the success of large-scale language models (LLMs), some fundamental challenges remain, particularly regarding continuous learning, or the ability of models to actively acquire new knowledge and skills over time without forgetting old ones.
When it comes to continuous learning and self-improvement, the human brain is the gold standard. It adapts through neuroplasticity, the remarkable ability to change its structure in response to new experiences, memories, and learning. Without this ability, people become limited to immediate situations (such as anterograde amnesia). Current LLMs have similar limitations. LLM's knowledge is limited to either the immediate context of the input window or the static information learned during pre-training.
The simple approach of continually updating model parameters with new data often results in “catastrophic forgetting” (CF), where learning new tasks comes at the expense of proficiency in old tasks. Researchers have traditionally combated CF through architectural adjustments and improved optimization rules. But for too long, we have treated model architecture (network structure) and optimization algorithms (training rules) as two separate things, preventing us from achieving truly integrated and efficient learning systems.
Our paper, “Nested Learning: The Illusion of Deep Learning Architectures,” presented at NeurIPS 2025, introduces nested learning to fill this gap. Nested learning treats a single ML model not as one continuous process, but as a system of interconnected, multi-level learning problems that are optimized simultaneously. We argue that the architecture of a model and the rules used to train it (i.e., optimization algorithms) are essentially the same concept. They just differ in their “levels” of optimization, and each has its own internal flow of information (“context flow”) and update rate. Recognizing this inherent structure, nested learning provides a previously unseen new dimension for designing more capable AI, allowing us to build learning components with greater computational depth, and ultimately helping to solve problems such as catastrophic forgetting.
We test and validate nested learning through a proof-of-concept self-modifying architecture we call “Hope.” It achieves superior performance in language modeling and demonstrates better long-context memory management than existing state-of-the-art models.
