Researchers are tackling the important challenge of optimizing large-scale language models (LLMs) for complex multi-turn inference tasks. Shichao Ma, Zhiyuan Ma, and Ming Yang of Ant Group Co., Ltd. Tiansuan Lab, along with Xiaofan Li, Xing Wu, and Jintao Du, show that current reinforcement learning methods suffer from a “double homogenization dilemma,” failing to properly recognize the value of individual inference steps and struggling to accurately estimate dominance. The company’s new Turn-Level Stage Aware Policy Optimization (TSPO) framework addresses this issue by preserving important process-level signals and introducing a new reward mechanism that increases reward differentials, ultimately achieving significant performance improvements of up to 24% on leading LLMs such as Qwen2.5-3B and 7B, an important step toward more effective and nuanced LLM training.
This study addresses the “double homogenization dilemma”, which is a critical limitation of current reinforcement learning (RL) frameworks. This limitation prevents effective learning, as intermediate inference steps cannot be properly recognized and rewarded.
This dilemma manifests itself both as a homogenization of the process, where diverse inference paths receive the same reward, and as a within-group homogenization, where coarse-grained rewards limit the estimation of advantage during training. This innovative approach preserves important process-level signals, effectively distinguishes between successful and failed inference steps, and increases reward variance within groups without relying on external reward models or manual annotations.
TSPO facilitates more efficient and accurate information retrieval by focusing on early detection of correct information. This study reveals that using TSPO across a variety of question answering datasets significantly improves performance. Extensive experiments demonstrate that TSPO significantly outperforms existing state-of-the-art baselines, achieving average performance improvements of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.
This improvement is achieved by addressing the limitations that result-level rewards are sparse and the entire inference process is often compressed into a single scalar value, obscuring the quality of intermediate steps. Researchers observed that while existing methods attempt to address these issues through process-level monitoring, they often require costly annotations or rely on proprietary models with limited versatility.
However, TSPO avoids these drawbacks by leveraging the FOLR mechanism to allocate rewards based on the first occurrence of the correct answer, thereby enhancing both the preservation of process signals and the variance within the training group. This research paves the way for more effective training of LLMs for complex tasks, which could lead to advances in areas such as open-domain question answering and mathematical reasoning.
Turn-level reward allocation to improve inference in large-scale language models is a promising research direction
Scientists have identified a “double homogenization dilemma” that hinders reinforcement learning (RL) frameworks for search-enhanced inference, especially in large-scale language models (LLMs). TSPO pioneered the First-Occurrence Latent Reward (FOLR) mechanism, which dynamically allocates partial rewards to specific turns where the true answer first appears.
This innovative approach ensures that useful intermediate inference steps are evaluated even if the final answer is wrong. We used seven diverse question answering datasets in our experiments to rigorously evaluate the performance of TSPO. The research team implemented TSPO on both the Qwen2.5-3B and Qwen2.5-7B models and compared the results with state-of-the-art baseline methods.
Performance was measured using exact match (EM) as the primary metric, allowing a direct assessment of answer accuracy. This study demonstrated that TSPO significantly outperformed existing technologies, achieving an average performance improvement of 24% for the Qwen2.5-3B model and 13.6% for the Qwen2.5-7B model. This study addresses the “double homogenization dilemma,” a problem in which important process-level signals are lost during training and reward variance is reduced within a group of trajectories.
Experiments show that current methods often treat successful and unsuccessful completions of information retrieval, hindering effective learning. The team measured the impact of this dilemma by analyzing trajectories based on outcome accuracy and process integrity and identifying four categories: complete failure, near miss, complete success, and success without payback.
The data show that there is no success without search, confirming that successful search is essential for correct synthesis. Additionally, this study quantified that near-miss attempts and complete failures received the same zero reward, demonstrating process-level reward homogenization. TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism. This assigns a partial reward to the turn in which the true answer first appears.
This preserves process-level signals, increases the variance of rewards within the group, and avoids the advantage disappearing during training. Importantly, this breakthrough allows these improvements to be achieved without the need for external reward models or additional human annotations. Extensive experiments across seven diverse question answering datasets demonstrate that TSPO significantly outperforms state-of-the-art baselines.
The results show an average performance improvement of 24% for the Qwen2.5-3B model and 13.6% for the Qwen2.5-7B model. These measurements confirm that TSPO effectively addresses the double homogenization dilemma and that LLM can learn more efficiently and achieve higher accuracy in multi-turn inference tasks. TSPO assigns a partial reward to the first turn in which the correct answer appears within the acquired evidence, preserving important process-level information and increasing reward variance without the need for additional annotations or reward models.
Experiments across seven question answering benchmarks demonstrate that TSPO consistently outperforms existing baseline methods, achieving average performance improvements of 24% and 13.6% on the Qwen2.5-3B and 7B models, respectively, while maintaining computational efficiency. The authors acknowledge that FOLR has limitations, including its reliance on accurate retrieval, as it assumes that the correct answer is present in the retrieved data, and its current focus on search-augmented inference tasks.
Future research will focus on adapting TSPO to a wider range of task types and refining the FOLR mechanism for scenarios where retrieval is incomplete or unnecessary. Our findings highlight the importance of fine-grained turn-level rewards to effectively train LLMs for complex inference tasks, address key challenges in reinforcement learning for search inference systems, and potentially improve the performance and efficiency of LLM agents.
