New memory structure allows AI models to think longer and faster while using less power

AI News


Researchers at the University of Edinburgh and NVIDIA have introduced a new method that enables deeper inference without increasing the size or energy usage of large language models. The research, presented at the NeurIPS artificial intelligence conference, addresses core technical barriers that limit how much today's AI systems can “think” when solving complex problems in math, science, and coding.

The focus of the research is a memory structure known as a key-value cache (often shortened to KV cache). This cache stores information generated every time the model generates a new word or token. The cache grows each time an inference step is added, and if the cache grows too large, performance degrades rapidly. Even powerful GPUs have difficulty when they need to repeatedly retrieve large amounts of stored data during inference, the stage in which a model responds to prompts.

Rather than forcing models to shorten inference or relying on coarse-grained memory reduction, the research team has developed a new approach that allows models to manage memory more intelligently. Their technique, called Dynamic Memory Sparsification (DMS), allows models to compress memory by up to 8x while maintaining or increasing accuracy.

Average absolute gain (proxy of latency) of DMS over the original LLM during inference time scaling for inference tasks for the same KV cache memory read. (Credit: arXiv)

Dr Edoardo Ponti, a GAIL Fellow and Lecturer in Natural Language Processing at the School of Informatics at the University of Edinburgh, said the technique changes what a model can do within the same time limit. “In a nutshell, our model can reason faster with the same quality. Therefore, for the same amount of time spent inference, the model can explore more and longer inference threads. This improves our ability to solve complex problems in mathematics, science, and coding,” he said.

Why memory slows down modern AI inference

Scaling inference time has become a major focus of AI research. When a model faces a difficult problem, it often solves the problem step-by-step, producing long chains of intermediate inferences. Each step adds a new entry to the KV cache. As these entries accumulate, the model spends more time retrieving stored data than generating new insights.

Previous attempts to address this bottleneck have focused on aggressive memory trimming. Some methods removed tokens based on fixed rules, which reduced memory usage but often compromised accuracy. Other approaches have learned which parts of memory to keep, but they require lengthy retraining runs, which can be expensive and impractical.

Edinburgh and the NVIDIA team aimed to strike a balance. They wanted a system that could learn what to keep, avoid sudden information loss, and add to existing models without extensive retraining.

During each inference step (left), the received key-value pair (𝑘𝑡,𝑣𝑡)(kt ,vt) may be selected for later eviction based on the predicted binary decision 𝛼bin∈{0,1}αbin∈{0,1} (only the sequence of keys is shown for clarity). As soon as the two people fall through the sliding window, the eviction takes place. During training (right), this behavior is induced by an additional attentional mask. Eviction decisions are relaxed from binary to continuous 𝛼∈[0,1]α∈[0,1]. (Credit: arXiv)

How dynamic memory sparsification works

DMS allows the model to determine which tokens are no longer required, but delays their removal. Rather than removing information the moment it is marked for eviction, the model leaves it visible for a short time window. This delay gives the model a chance to transfer useful details to other held tokens.

The system relies on a learned mask that gradually reduces the influence of tokens as the sequence grows. Testing showed that immediate deletion rapidly degrades the quality of inference. In contrast, delayed eviction maintained stable performance even when memory was compressed to one-quarter or one-eighth of its original size.

Another important benefit is efficiency during training. The researchers were able to refine the large-scale model using approximately 1,000 training steps and achieve an 8x compression. Previous methods often required tens of thousands of steps and were still difficult at high compression levels. This makes DMS practical for a wide range of existing language models.

Get better results on demanding benchmarks

To test whether memory compression actually helps inference, the team evaluated DMS on demanding benchmarks that require multi-step thinking. These include AIME 2024 and MATH-500 for math, GPQA Diamond for advanced science questions, and LiveCodeBench for coding.

Model latency (y-axis) at different context lengths (x-axis). Top: Comparing the effects of different model sizes (Qwen-R1 1.5B, 7B, 32B) for the same batch size (32). Bottom: Comparing the effects of different batch sizes (32, 64, 128) for the same model (Qwen3-8B). (Credit: arXiv)

Across these tests, the model using DMS achieved higher accuracy under the same computational constraints. On AIME 2024, the American Mathematics Olympiad qualifying exam, compressed models scored about 12 points higher on average than uncompressed models using the same number of memory reads. GPQA Diamond, which includes biology, chemistry, and physics questions written by PhD-level experts, improved scores by more than 8 points. LiveCodeBench results showed an improvement of about 10 points.

The team tested multiple model sizes, including 1.5 billion, 7 billion, and 32 billion parameter versions of Qwen, along with the Llama model. In most cases, DMS outperformed both heuristic memory cuts and other learned sparsification techniques.

Faster response and lower energy costs

“Beyond accuracy, our team measured real-world performance. When the context length was short, the compressed and uncompressed models performed similarly. However, as the context became longer, the compressed model avoided the spike in latency. This allowed it to process deeper inference without consuming too much GPU memory,” Dr. Ponti told The Brighter Side of News.

“Throughput was also improved. When running the model at the largest batch size that would fit in memory, the DMS-enabled system processed far more queries per minute. For large-scale deployments, this translates to lower costs and lower power usage per task,” he continued.

“The same method can also be used in a different way. Instead of doing deeper inference on a single query, compressing memory allows the model to respond to more users at once. This reduces energy usage per response, which is an important consideration as AI systems scale,” he concluded.

GSM8K 0-shot scores for Llama 3.2 1B Instruct across different compression variants. Left: Delayed eviction with a 16-token window (default) consistently maintains the model's inference ability, while immediate eviction causes rapid degradation. As compression increases, the difference in quality only widens. Right: DMS requires orders of magnitude less data for training than DMC. This was also observed for the Qwen 2.5 R1 model with parameter scales of 1.5B, 7B, and 32B. (Credit: arXiv)

Reliability beyond complex reasoning

The research team also investigated whether memory compression had a negative impact on daily tasks. The DMS model closely matched the performance of the original model across general knowledge, conversation, following instructions, and coding benchmarks. There was even a slight performance improvement on some coding tasks.

The study also revealed patterns in how the model uses memory. Early layers tended to preserve more detailed information, while later layers compressed more aggressively. Compression also increases as the sequence gets longer, suggesting that the model relies less on early tokens once enough context is established.

Practical implications of the research

The findings suggest a clear path to more efficient and capable AI systems. DMS can improve problem solving in fields that rely on complex analysis, such as science, engineering, and software development, by enabling models to make deeper inferences without using additional hardware. This method also supports sustainability goals by reducing energy usage per task.

In practical terms, this approach could benefit AI systems running on devices with limited or slow memory, such as smart home devices and wearable technologies. Data centers may also be able to handle higher workloads without increasing power demands.

Dr. Ponti and his team continue their research through the European Research Council-funded AToM-FM project, which will continue to focus on understanding how large-scale AI systems store, forget, and reuse information more efficiently.

The research results are available online in the journal arXiv.






Source link