BRORL expands reinforcement learning with expanding exploration, benefiting over thousands of steps through hundreds of rollouts

Reinforcement learning technology is increasingly essential for the development of complex artificial intelligence systems, and researchers are continuing to seek ways to improve performance and scalability. Jian Hu, Mingjie Liu, Ximing Lu, and colleagues have shown significant advances in this field with the development of Brorl, a new approach to expanding exploration during reinforcement learning. Their work addresses the important limitations of existing ways in which performance can earn plateaus after constant training by dramatically increasing the number of rollouts examined per example. This strategy is informed by a detailed analysis of the mass change in probability during learning, ensuring continuous improvement even when exceeding the saturation points observed in previous methods, and ultimately achieves latest results with challenging benchmarks of large-scale language models. By thoroughly exploring the broader possibilities, Brorl will revive performance after existing methods reach limitations, paving the way for a more powerful and capable AI system.

Various rollouts improve inference in language models

This study introduces Brorl, a new reinforcement learning method that enhances the inference ability of large-scale language models. The core principle behind Brorl is that increasing the variety of experiences during training is more effective than simply extending the duration of your training. Researchers achieve this diversity by significantly increasing the number of samples known as rollouts generated from each prompt during the learning process. This approach is consistently superior to existing methods that often reach a plateau of performance. Reinforcement learning trains agents to make decisions within the environment to maximize rewards.

As the rollout size increases, the model can take into account the wider possibilities during each training update. The results consistently show that Brorl outperforms existing methods across a variety of tasks, achieving statistically significant improvements, and avoiding performance plateaus. This study shows that larger rollout sizes are important to improve performance of large-scale language models on complex inference tasks. BRORL highlights its efficiency by achieving comparable or better results with less GPU time and a similar number of total generated samples.

Brorl extends reinforcement learning with a wide range of exploration

This study, Pioneers Brorrl, is a way to expand reinforcement learning by dramatically increasing the number of rollouts per example. This approach addresses the performance plateaus that you often encounter when simply increasing training steps. The researchers found that they based on a detailed analysis of how the mass of probability shifts between correct and incorrect tokens during reinforcement learning, and found that the rollout consistently contributes to increasing the probability of correct answers. To test this theoretical framework, scientists developed token-level simulators to reflect the update analysis for each token.

Experiments show that increasing the rollout size minimizes the impact of unsampled actions, resulting in faster policy updates and accumulation of the probability mass of the correct token. The results show that larger rollout sizes accelerate the growth of the correct probability mass and increase the percentage of correct tokens that improve at each step. In further research, Brorl was applied to large-scale language models, achieving cutting-edge results across diverse benchmarks. This work establishes a principled method of robust and continuous improvement in reinforcement learning, transforming theoretical assurance into practical benefits of complex inference tasks.

Brorl brings back improvements in reinforcement learning performance

Scientists have achieved continuous improvements in reinforcement learning performance by systematically increasing the number of rollouts per example, a technique called BRORL. This study addresses the performance plateau observed when expanding reinforcement learning with just an increase in training steps, shows that performance can be revived and continually improved beyond established checkpoints. The team's approach is based on mass balance equation analysis. This indicates that increasing the number of rollouts increases the probability mass of the correct token overall. The simulation confirms that a sufficiently large rollout size ensures an increase in the probability mass of all correct tokens and effectively eliminates knowledge contraction. Brorl achieves cutting-edge results across diverse benchmarks, both data and computationally efficient, highlighting the practicality of Brorl for real deployments.

Rollout size stabilizes language model learning

This work has been shown to establish rollout size as an important factor in scaling augmented learning for large-scale language models, increasing the number of rollouts sampled per prompt consistently improves performance beyond the limits that simply arise when increasing training steps. The researchers found that performance plateaus arise not from the fundamental limitations of the learning process but from instability caused by insufficient investigation of possible solutions. Through mass balance equation analysis, they identified the term “unsampled bond” that contributes to these instability, and proved that increasing the rollout size systematically reduces its effects and ensures a more reliable learning signal. Importantly, this improvement was achieved through improved computational efficiency, shifting bottlenecks from memory to calculations. This study provides a pathway to continuous learning and improved reasoning ability for large-scale language models.

Source link