
Manually designing the reward function is time consuming and can have unintended consequences. This is a major obstacle in the development of general-purpose reinforcement learning (RL)-based decision-making agents.
Previous video-based learning methods have rewarded the agent whose current observation is closest to the expert’s observation. Rewards are conditional only on the current observation, so they cannot capture meaningful activity over time. And generalization is hampered by adversarial training techniques that lead to mode collapse.
Researchers at the University of California, Berkeley, have developed a new method for extracting incentives from video prediction models called Video Prediction Incentives for Reinforcement Learning (VIPER). VIPER can learn reward functions from raw films and generalize to untrained domains.
First, VIPER uses movies made by experts to train a prediction model. We then train an agent with reinforcement learning using a video prediction model to optimize the log-likelihood of the agent’s trajectory. The agent’s trajectory distribution should be minimized to match that of the video model. Using the likelihood of the video model directly as a reward signal, the agent can be trained to follow a trajectory distribution similar to that of the video model. Unlike observation-level rewards, the rewards provided by video models quantify the temporal consistency of behavior. Also, likelihood evaluation is much faster than doing a video model rollout, resulting in a shorter training window and better interaction with the environment.
Across 15 DMC tasks, 6 RLBench tasks, and 7 Atari tasks, the team conducted exhaustive research, demonstrating that VIPER can achieve expert-level control without using task rewards. Our findings show that VIPER-trained RL agents outperformed adversarial imitation learning across the board. It doesn’t matter which RL agent is used as VIPER is integrated into the configuration. The video model is already generalizable to arm-task combinations not encountered during training, even for small dataset regions.
The researchers believe that using a large pre-trained conditional video model could allow for more flexible reward functions. With the help of recent breakthroughs in generative modeling, they believe their work will provide the community with a foundation for scalable reward specifications from unlabeled movies.
Please check paper and plan.don’t forget to join 22,000+ ML SubReddit, Discord channeland email newsletterShare the latest AI research news, cool AI projects, and more. If you have any questions regarding the article above or missed something, feel free to email me. Asif@marktechpost.com
🚀 Check out 100’s of AI Tools at the AI Tools Club
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data her science enthusiast and has a keen interest in the range of applications of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its practical applications.
