UC San Diego researchers propose DrS: a new machine learning approach for data-driven learning of reusable, dense rewards for multi-step tasks

The success of many reinforcement learning (RL) methods relies on dense reward functions, whose design can be difficult due to the need for specialized knowledge and trial-and-error. Sparse rewards, such as binary task completion signals, are easy to obtain but pose challenges to RL algorithms such as search. As a result, the following questions arise: Can dense reward functions be learned in a data-driven manner to address these challenges?

Existing research on reward learning often overlooks the importance of reusing rewards for new tasks. For learning reward functions from demonstrations, known as inverse RL, techniques such as adversarial imitation learning (AIL) have gained attention. Inspired by GANs, AIL uses a policy network and a discriminator to generate and distinguish trajectories, respectively. However, AIL rewards cannot be reused across tasks, which limits their ability to generalize to new tasks.

Researchers from the University of California, San Diego attended Intensive rewards learned from stages (DrS), is a unique approach to learn reusable rewards by incorporating sparse rewards as supervisory signals instead of the original signals for demonstrating and classifying agent trajectories. This involves training a discriminator that classifies success and failure trajectories based on binary sparse rewards. Transitions within success trajectories are assigned higher rewards, and transitions within failure trajectories are assigned lower rewards, ensuring consistency throughout training. Once training is complete, rewards are reusable. Expert demonstrations can also be included as success trajectories, but are not required as the required reward is small and this is often specific to the task definition.

The DrS model consists of two phases: reward learning and reward reuse. In the reward learning phase, a classifier is trained to distinguish between successful and unsuccessful trajectories using sparse rewards. This classifier acts as a dense reward generator. In the reward reuse phase, we apply the learned dense rewards to train a new RL agent on the test task. Stage-specific discriminators are trained to provide dense rewards for multi-stage features in each stage, ensuring effective guidance throughout task progression.

The proposed model was evaluated on three difficult physical manipulation tasks: pick-and-place, turning a faucet, and opening a cabinet door. Each contains different objects. The evaluation focused on the reusability of the learned rewards, utilizing non-overlapping training and test sets for each task family. In the reward learning phase, we trained the agent to learn rewards by manipulating training objects, and in the reward reuse phase we reused these rewards to train the agent on test objects. This study utilized the Soft Actor-Critic (SAC) algorithm for evaluation. Results showed that learned rewards exceeded baseline rewards across all task families, and in some cases were comparable to human-operated rewards. Semi-dilute rewards have shown limited success, while other reward learning methods have failed.

In conclusion, this study introduces DrS, a data-driven approach for learning dense reward functions from sparse rewards, evaluated in a robot manipulation task and in transfer between tasks where objects have different shapes. This shows the effectiveness of DrS. This simplification of the reward design process is expected to allow RL applications to scale up in a variety of scenarios. However, the multistage version of this approach suffers from two main limitations. First, the acquisition of task structure knowledge remains unexplored and could potentially be addressed using large-scale language models or information-theoretic approaches. Second, relying on stage metrics can pose challenges when directly training RL agents in real-world settings. However, if desired, stage information can be obtained using tactile sensors or visual detection/tracking methods.

Please check paper. All credit for this research goes to the researchers of this project.Don't forget to follow us twitter.Please join us telegram channel, Discord channeland linkedin groupsHmm.

If you like what we do, you'll love Newsletter..

Don't forget to join us 40,000+ ML subreddits

Asjad is an intern consultant at Marktechpost. He is pursuing a degree in mechanical engineering from the Indian Institute of Technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast and is constantly researching applications of machine learning in healthcare.

🐝 Join the fastest growing AI research newsletter from researchers at Google + NVIDIA + Meta + Stanford + MIT + Microsoft and more…

Source link