Recent advances in robotics have increasingly been reliant on visual action (VLA) models, but expanding these systems requires vast amounts of expensive, human-generated data, and struggles to adapt to new situations. Haozhan Li, Yuxin Zuo, and Jiale Yu together with colleagues address these challenges by introducing SimpleVLA-RL, a new framework that leverages reinforcement learning to train VLA models more efficiently. This innovative approach significantly reduces the need for large datasets, allows for robust performance in the face of unfamiliar tasks, clearly outperforms existing monitored learning methods on benchmark platforms such as Libero and Robotwin. The team's work not only improves robotics functionality, but also reveals surprising behavior during training, identifying previously unseen patterns of how the robot manipulates the manipulation of objects, suggesting paths to more adaptive and intelligent robotic systems.
Learning robots through language and enhancement
The Vision-Language-active (VLA) model has emerged as a powerful approach to robot manipulation. Despite recent advances with large-scale pre-training and supervised fine-tuning, these models face challenges related to the rarity of robotic data for a wide range of human manipulation and the limited ability to generalize to new tasks. Inspired by the successful application of reinforcement learning to large-scale language models, researchers are investigating whether this technique can enhance robotic learning and improve sample efficiency and adaptability. In this work, we can use reinforcement learning to train VLA models, reduce reliance on a wide range of datasets, allow robots to learn from limited demonstrations, explore new task variations, and ultimately create more robust and reliable robotic systems.
Robotics learns from reinforcement learning technology
SimpleVLA-RL presents a new framework for training robots using reinforcement learning that addresses the limitations of current visual language action (VLA) models. Researchers recognized that scaling VLA models requires a significant amount of human-manipulated robot data, which is expensive and lacking, and that these models often struggle to generalize. Inspired by recent success using reinforcement learning to improve inference in large-scale language models, the team investigated whether a similar approach could enhance long-distance action planning in VLA systems. This work builds on the existing reinforcement learning framework of language models and introduces SimpleVla-RL that fits the unique challenges of robot control.
Key innovations include VLA-specific trajectory sampling, optimized loss calculations, and parallel multi-environment rendering to accelerate training. Experiments show that SimpleVla-RL achieves cutting-edge performance with benchmark robot platforms Libero and Robotwin 1.0 and 2.0, consistently improving success rates by 10-15%. In particular, the team observed a significant improvement in data efficiency. With only one demo per task, the supplementary learning increased the success rate of libero from 17.
1% to 91. 7%. Furthermore, the system demonstrated strong generalization capabilities across spatial arrangement, objects, and tasks. The surprising result is the discovery of a “push cut,” a novel pattern of behavior presented by policy during training, indicating that the system has learned strategies that are not present in the initial monitored data. Simulation-trained policies were successfully transferred to real-world robots, demonstrating the possibilities of practical deployment without extensive real-world training data.
Robot operation can be learned from limited experience
This study introduces SimpleVla-RL, a new framework that uses reinforcement learning to train visual language action models in Robotic Manipulation. This approach addresses important limitations of current methods. This relies on a large dataset of human-manipulated robot movements that are expensive, difficult to obtain, and often struggle to generalize to new situations. SimpleVLA-RL demonstrates improved performance for benchmark tasks, surpassing existing monitored fine-tuning methods to achieve cutting-edge results on both simulated and real robotic platforms. In particular, this framework not only reduces the need for extensive pre-recorded data, but also allows for more robust performance when faced with unfamiliar tasks and environments. During training, the system unexpectedly discovered a new strategy called “push cuts.” This indicates the ability to learn beyond the patterns present in the initial training data.
