The constraints inherent in the Large Language Model (LLM) context window – finite memory that determines how much input an AI can process at once – have long been considered a fundamental bottleneck for truly long-term tasks. That bottleneck has been decisively broken. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have introduced recursive language models (RLMs), a popular inference strategy that extends the effective input length to more than 10 million tokens, delivering up to two orders of magnitude performance improvement over current frontier models.
This paper, written by Alex L. Zhang, Tim Kraska, and Omar Khattab, addresses the critical issue of “context rot,” a phenomenon in which LLMs deteriorate rapidly as the context lengthens. This decline is especially noticeable in complex tasks that require deep inferences and comparisons across disparate parts of large amounts of input, such as analyzing large codebases or detailed research documents. Traditional attempts to manage long contexts often rely on “context compression or compaction,” where input is repeatedly summarized when it exceeds a certain length threshold. This approach is inherently lossy and sacrifices important details in favor of brevity, leading to catastrophic failures in multihop inference tasks, as demonstrated by the sharp drop in performance exhibited by basic models such as GPT-5 as token length increases.
The core insight driving RLM is a paradigm shift in how input prompts are handled. Rather than feeding the entire bulk of the input directly to the neural network (which is a resource-intensive and often wasteful task), the input is “instead treated as part of the environment that the LLM can symbolically interact with.” RLM loads long prompts as variables in the Python Read-Eval-Print Loop (REPL) environment. Therefore, LLM does not require you to memorize the entire text at once. Write and run code using intelligence to recursively query external variables in relevant snippets. This feature allows the model to “peek into programmatic snippets of variables, decompose them, and call themselves recursively.” This strategy fundamentally bypasses the physical limitations of the transformer architecture and turns the core model into an intelligent search and inference engine capable of detailed iterative analysis over inputs of arbitrary length.
The empirical results are vivid and convincing, especially when evaluating RLM against benchmarks designed to test complex long-context processing. Across four diverse tasks: deep research, information aggregation, code repository understanding (CodeQA), and synthetic pairwise inference (OOLONG), RLM “showed very strong performance even at scales of 10 million+ tokens, dramatically outperforming all other approaches in long context handling.” On the important BrowseComp+ (1K document) task, RLM powered by GPT-5 achieved a 91.33% of accuracy, far below the performance of the base model and summarization agent, which often failed catastrophically when processing inputs of 6 to 11 million tokens.
Importantly, this performance does not come at a prohibitive cost. The study demonstrates that “RLM is up to three times cheaper while maintaining strong performance across all tasks because the model can selectively display context.” Because LLM queries only certain relevant chunks of external context variables rather than processing the entire input sequence for each generated token, inference costs are comparable to or lower than the base model invocation in most cases. This operational efficiency and superior performance represents a major advancement for companies working with large unstructured data sets, from forensic discovery and defense intelligence analysis to large-scale code repository management.
This breakthrough highlights a deeper trend in AI development: the growing importance of scaffolding and inference strategies built around foundational models. RLM is inherently model independent. This means that the technology can be applied to existing LLM architectures, whether closed (GPT-5) or open source (Qwen3-Coder). The researchers noted that the recursive subcall feature offers “strong advantages for information-dense inputs,” essentially leveraging LLM’s inherent inference capabilities to efficiently navigate and synthesize large amounts of data. The limitation was not strictly the intelligence of the model, but the constrained mechanism by which data was delivered to the model. By offloading context to an external, searchable environment, this system completely circumvents the limitations of physical context windows and proves that strategic software design can yield performance benefits previously thought to be achievable only through brute force scaling of model parameters.
The effectiveness of the REPL environment in handling long inputs combined with recursive subcalls is necessary for tasks that require semantic understanding and aggregation across large numbers of data points. For example, in tasks like OOLONG that require reasoning across semantically distinct chunks of input, a recursive approach allows the model to build complex logical structures step by step. This cannot be achieved with traditional summarization and compression methods without significant information loss. The ability to dynamically query and reason about the full uncompressed context, rather than relying on a static summary version, transforms LLM from a powerful predictor to a true long-term reasoning agent.
