Reinforcement learning is increasingly enhancing the capabilities of large-scale language models and driving advances in inference and coding found in models such as GPT-O, DeepSeek-R1, and Kimi-K1. 5. Yuzhen Zhou of Carnegie Mellon University, Jiang Lee and Yusheng Susu, Advanced Micro Devices, Inc. Addresses the key bottleneck disturbance progression with colleagues including Gowtham Ramesh and Lmsys Org's Zilin Zhu and Xiang Long. Their research introduces positive partial rollouts in reinforcement learning, or April, to introduce new ways to significantly improve efficiency by actively managing rollout requests and recycling incomplete responses. This approach addresses the issue of long-term response stall handling, maintaining both GPU training and final accuracy consistently, improving performance by up to 44% in throughput and 8% for a variety of tasks. The April framework and hardware independence demonstrated through integration with the Slime RL framework and compatibility with AMD GPUs represents a substantial step towards scalable and efficient reinforcement learning.
Reinforcement learning framework for large-scale language models
Various projects and frameworks are working to reinforce the large-scale language model. While DAPO provides an open source system for scaling augmented learning, Slime provides a high-performance post-training framework designed for efficient scaling. Llamarl presents a distributed asynchronous framework for large-scale language model training, with another system focusing on optimizing augmented learning using a user-friendly scaling library. Further research will explore methods such as StreamRL that aim to create scalable and resilient reinforcement learning of language models, along with methods to optimize group sequence policy and reinforcement learning from human feedback training.
Specific optimization techniques have also been developed. sortedRL accelerates training through online length-aware scheduling, and researchers are investigating how scaling test time calculations are more effective than increasing model parameters. Specexec provides very parallel speculative decoding for inference of interactive language models on consumer devices, but other work focuses on inference under policy guidance and leveraging out-of-policy reinforcement learning training. Infrastructure and system-level optimization are also very important, advances in projects such as ORCA, distributed serving systems for transformer-based models, and fully sharded data parallel training.
Sglang provides efficient execution of structured language model programs, contributing to system-wide performance. The researchers also publish detailed technical reports on model architectures such as the QWEN3 model. Basic work continues with important algorithms, such as statistical gradient addiction algorithms for connectionist reinforcement learning and investigation into the secrets of reinforcement learning from human feedback from human feedback in large-scale language models. Other related projects include further advances in efficient execution of structured language model programs based on the Transformer Rehnecortion Learning (TRL) and Sglang framework. Traditional methods often have low GPU usage as it takes a lot to generate rollouts and force faster sequences to complete the longest. To overcome this, the team designed a system that intentionally overprovisions rollout requests to the inference engine, beyond the standard batch size. Once the target number of completed rollouts is reached, the system actively terminates the remaining unfinished sequences, preventing wasted calculations and minimizing idle GPU time.
By recycling these partial results, April systematically reduces the long-term tail effects of various rollout lengths, significantly reducing GPU idle time and increasing overall training efficiency. The experiments show that April improves at least 20% with commonly used reinforcement learning algorithms including GRPO, DAPO, and GSPO, and on a variety of large-scale language models. Additionally, the team will rigorously test April performance, not only accelerate rollout generation, but also achieve convergence faster, increasing final accuracy by around 2% to 5% across tasks. The system is designed for wide compatibility, integrated into the Slime Renuferation Learning Framework, and is well deployed on both NVIDIA and AMD GPUs. Researchers addressed the substantial computational costs of rollout generation, which currently dominates reinforcement learning training time, by actively reusing partial rollouts rather than discarding incomplete rollouts. By overlooking requests and recycling responses, April avoids wasted calculations, significantly increasing throughput, and achieves an improvement of up to 44% with some commonly used algorithms. The team demonstrated that April not only accelerates training and improves throughput, but also maintains accuracy and achieves up to 8% higher final accuracy on a variety of tasks. Importantly, April is compatible with existing frameworks and hardware, including both NVIDIA and AMD GPUs, and is already integrated into the SLIME RL framework. The authors acknowledge that the method represents a step towards a more efficient training pipeline, and predict that the principles behind April will stimulate further advances in adaptive rollout strategies and the design of reinforcement learning frameworks that fully utilize modern hardware capabilities.
