Significantly reduce LLM computational power costs and linear inference costs

Machine Learning


Using reinforcement learning to give large-scale language models (LLMs) inference power is certainly effective, but it comes at a high cost.

Such models generate a long chain of thought (LongCoT) before answering a question. Additionally, increasing the number of “thought tokens” can enhance the functionality of the model. As with any reinforcement learning problem, there is an environment that determines how the trajectory is generated.

For reasoning LLMs, this environment is very simple and often overlooked. The state is formed by concatenating the prompt and the previously generated inference tokens, while the action is the next token sampled from the policy (i.e. the inference LLM).

Although this design looks elegant, the state size can become unlimited. The state size continues to grow as the thought process gets longer. For attention-based policies, this means that the computational cost of the entire process faces a daunting quadratic increase.

Many methods have been proposed to reduce the computational cost of long thoughts in LLM inference, including length regularization, pruning, or the use of objective functions with early stopping methods.

Recently, a collaborative research team from multiple institutions, including Mila and Microsoft Research, took a different approach and posed different questions. What if the environment does not cause a quadratic increase in computational cost to begin with?

They proposed a new paradigm in which policy makes inferences based on fixed-size states. they have such a policy markov thinker.

Paper Title: The Markov Thinker

Paper link: https://arxiv.org/abs/2510.06557v1

Model link: https://huggingface.co/collections/McGill-NLP/the-markovian-thinker-68debd2919c4ae47f50706cd

Code repository: https://github.com/McGill-NLP/the-markovian-thinker

Amirhossein Kazemnejad, one of the study's three co-lead authors, told X that Delethink's effectiveness revolutionized the reinforcement learning thinking environment. Furthermore, the extent and effectiveness of Markov thinking shows that: The inference that LLMs can be built differently, perhaps using non-quadratic architectures..

markov thinker

The central idea of ​​Markov thinkers is to restructure the form of reinforcement learning so that the effective state size read by the policy is limited, regardless of the total length of the thought. The direct impact is significant. Longer thought processes require only a linear computational cost and a constant memory that is related to the length of the thought, thus separating the two issues: how much time the model has to think and how much context the model has to process.

They fleshed out this idea. Dere sink paradigm. It is a reinforcement learning environment that guides Markov behavior by organizing the inference process into a series of fixed-sized chunks.

Delethink redefines the thought reinforcement learning environment as a chunked Markov process. Its generation process occurs in fixed size chunks. At each chunk boundary, the environment resets the context to a new prompt containing the original query and a short continuation from the previous chunk.

This forces the policy to learn to advance its thinking across chunks by maintaining the state of the text, creating a “Markov thinker.”

In contrast, the LongCoT environment concatenates tokens without limit, so its state (and model context) continues to grow as the trajectory lengthens.

The pseudocode for Algorithm 1 shows the training process for a single query.

Please refer to the original paper for details. In other words, with this design, both the generation and backpropagation stages for policy updates scale linearly in Delethink, while they scale quadratically in LongCoT. The following figure shows the changes in FLOP, memory, backpropagation time, and generation time for LongCoT and Delethink as the think length increases from n tokens to nS tokens.

great results

The team conducted an experiment. Delethink's results are quite remarkable. Even when inferring using 8K sized chunks, the DeepSeek R1-Distill 1.5B model trained on Delethink can think up to 24K tokens. Under the same 24K thought budget, the performance on mathematical benchmarks can reach and exceed that of LongCoT-RL.

When it comes to increasing test time, even when LongCoT-RL's performance is saturated, Delethink continues to improve and provide further benefits.

Additionally, they trained the R1-Distill 1.5B model with Delethink to think up to 96,000 tokens. With just a few additional training steps, AIME'24 achieved 49% accuracy with an average problem-solving process length of 36,000 tokens.

The effects of linear calculations are significant. Based on experimental data, when the average thought length is 94K, LongCoT-RL training takes 27 hours 100 months, while using Delethink only takes 7 hours 100 months.

Why is it effective?

To investigate why Delethink training is effective, they also analyzed the model's performance during the reinforcement learning initialization stage.

They observed that the R1-Distill series models (1.5B to 14B) can sample Markov trajectories in a zero-shot manner without any additional training or prompting, and can even recover most of the performance of standard LongCoT.

This strong initialization (i.e., a large number of in-distribution positive samples that match the expected behavior) provides a favorable starting point for reinforcement learning.

They further studied inference models with up to 120B parameters in the Delethink environment. For example, GPT-OSS 120B (Agarwal et al., 2025) demonstrates robust Markov thinking abilities in multiple domains, including doctoral-level problems, programming tasks, math competitions, and crossword puzzles.

Taken together, these results demonstrate that Delethink is compatible with and scalable to state-of-the-art models.

conclusion

The success of Markov thinking shows that by decoupling thought length from context size, next-generation inference models can, in principle, think with millions of tokens. This highlights that reinforcement learning environments, which are often seen as fixed, are actually powerful tools for accelerating progress.

This also suggests that non-quadratic complexity sequence architectures may be particularly beneficial for inference models, as thought processes can be effectively transformed into Markovian processes.

This article is written by WeChat official account “Almost Human” (ID:mosthuman2014), author: Panda. Republished by 36Kr with permission.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *