Researchers are grappling with a major bottleneck in reinforcement learning (RL): the computational cost of training large language models. NVIDIA’s Haocheng Xi, Charlie Ruan, Peiyuan Liao, along with Yujun Lin, Han Cai and others, identified a critical flaw in current approaches to accelerating RL using FP8 accuracy. The idea is that the mismatch between the training and rollout phases causes instability and reduced accuracy. Their new framework, Jet-RL, addresses this issue by integrating FP8 accuracy throughout both training and rollout, dramatically reducing numerical discrepancies and enabling significantly faster and more stable learning. Experiments show up to 41% speedup and 16% end-to-end improvement in training without sacrificing accuracy.
The team achieved this by addressing the computational inefficiencies inherent in traditional RL, where over 70% of total training time is often consumed during the rollout phase. This study presents the first comprehensive study on FP8 RL training and reveals that the commonly used BF16 training + FP8 rollout strategy suffers from severe instability and reduced accuracy, especially in long-term rollouts and difficult tasks.
This study reveals that these failures are due to the numerical mismatch between training and inference caused by the off-policy nature of the approach, i.e., the mismatch that accumulates during the extended inference sequence. Inspired by these observations, researchers propose Jet-RL, which employs a unified FP8 precision flow for both training and rollout. This minimizes discrepancies in these numbers and eliminates the need for inefficient step-to-step calibrations. Extensive experiments validate the effectiveness of Jet-RL, demonstrating up to 33% speedup in the rollout phase, ~41% speedup in the training phase, and 16% end-to-end speedup compared to BF16 training. Importantly, Jet-RL maintains stable convergence across all settings with negligible accuracy loss, a significant improvement over existing methods.
The innovation lies in establishing a truly policy-based FP8 training paradigm, ensuring robustness and adaptability across diverse training configurations. By employing a mixed group-wise and block-wise quantization scheme along with a state-of-the-art FP8 GEMM kernel, Jet-RL delivers significant speedups for end-to-end RL training, paving the way for more efficient and powerful LLMs capable of tackling increasingly complex inference tasks. This research establishes the foundation for future advances in LLM training and may enable the development of AI systems with enhanced problem-solving abilities and broader applications.
FP8 Reinforcement learning stability and optimization are important
Scientists have identified a significant bottleneck in reinforcement learning (RL) training of large-scale language models (LLMs): the rollout phase, which consumes more than 70% of the total training time. To address this, researchers conducted a comprehensive study of FP8 RL training and challenged the common BF16 training + FP8 rollout strategy. Experiments reveal that this common approach suffers from training instability and reduced accuracy, especially for long-term rollouts and complex tasks. This study identified numerical discrepancies between training and inference resulting from the out-of-policy nature of the method as the root cause.
Motivated by these findings, the team designed Jet-RL, a new FP8 RL training framework designed for robust and stable optimization. Importantly, Jet-RL employs a unified FP8 precision flow for both training and rollout, minimizing numerical discrepancies and eliminating the need for inefficient step-to-step calibration. The researchers implemented this by converting all computations, actor updates, policy evaluations, and rollout generation into FP8 format to ensure consistency across the training pipeline. This innovative approach is in sharp contrast to existing methods that maintain BF16 accuracy during training while quantizing to FP8 only during rollout.
Figure 2 shows how the rollout phase dominates the latency, accounting for more than 75% of the total time for rollouts of more than 8,000 tokens, effectively mitigating the bottleneck Jet-RL. Figure 3 highlights the failure of the BF16-train-FP8-rollout method as the rollout length increases while Jet-RL maintains its performance, demonstrating the effectiveness of the integrated high-precision flow. This work opens new directions for efficient RL training and enables the development of more powerful and resource-efficient LLMs.
SchoolJet-RL accelerates reinforcement learning with FP8 accuracy,
Scientists achieved a 33% speedup in the rollout phase of reinforcement learning (RL) training by adopting a new framework called Jet-RL. This breakthrough addresses a significant bottleneck in training large-scale language models (LLMs), which traditionally consumes more than 70% of total training time during the rollout phase. The research team demonstrated that combining traditional BF16 training with the FP8 rollout strategy suffers from instability and reduced accuracy, especially for long-term rollouts and complex tasks. Analysis revealed that this was due to a numerical discrepancy between training and inference due to the out-of-policy nature of the approach.
To overcome these limitations, researchers developed Jet-RL, an FP8 RL training framework that utilizes an integrated FP8 high-precision flow for both training and rollout. This innovative approach minimizes numerical discrepancies and eliminates the need for inefficient step-to-step calibrations, resulting in highly stable and robust RL optimization. Experiments confirm that Jet-RL achieves up to 41% speedup in the training phase itself and 16% end-to-end speedup compared to standard BF16 training. Measurements show that the method maintains stable convergence across all settings tested, with negligible accuracy loss.
The team designed a framework that uses the same quantization precision for both training and inference, effectively resolving policy mismatches and streamlining the optimization process. Jet-RL employs a mixed group-wise and block-wise quantization scheme and leverages the state-of-the-art FP8 GEMM kernel to accelerate end-to-end RL training. Comprehensive experiments across diverse models, datasets, and rollout configurations validate Jet-RL’s effectiveness, stabilize training, and minimize differences between training and rollout. Specifically, testing the 32B model achieved a 1.33x speedup in the rollout phase, while the 8B model saw a 1.41x speedup in the training phase and a 1.16x end-to-end speedup.
Compared to the BF16-train-FP8-rollout method, which typically incurs more than 5% performance degradation, Jet-RL reduces this to approximately 1%. These findings confirm that Jet-RL provides a robust solution for efficient low-precision RL training, allowing significant speedups without compromising performance. In this study, we identified that the common BF16 train-FP8 rollout paradigm leads to training instability and reduced accuracy under prolonged rollout generation and difficult tasks.
Jet-RL resolves accuracy discrepancies and significantly increases speed
Scientists demonstrated significant instability and accuracy collapse when employing a common reinforcement learning (RL) training strategy that included BF16 accuracy for training and FP8 accuracy for rollout, especially for long-term or complex tasks. Their analysis revealed that this performance drop was due to the numerical mismatch between the training and inference processes inherent in this off-policy approach. To address this, researchers introduced Jet-RL, a new FP8 RL framework that utilizes integrated FP8 precision for both training and rollout. This effectively minimized these discrepancies and eliminated the need for costly step-to-step calibrations. Extensive experiments confirmed the effectiveness of Jet-RL, achieving speedups of up to 33% in the rollout phase, 41% in training, and 16% end-to-end compared to BF16 training, while maintaining stable convergence and negligible accuracy loss.
The authors acknowledge that their findings are based on specific large-scale language models and tasks, and that further research is needed to explore the generalizability of Jet-RL across diverse architectures and problem domains. Future research may investigate adaptive accuracy schemes and their combination with other acceleration techniques to further optimize the RL training pipeline. This study establishes a robust and efficient method for FP8 RL training, paving the way for the development of more scalable and resource-aware intelligent systems.
👉 More information
🗞 Jet-RL: Enabling on-policy FP8 reinforcement learning with integrated training and rollout accuracy flows
🧠ArXiv: https://arxiv.org/abs/2601.14243
