TD3-BST: Machine learning algorithm that uses uncertainty models to dynamically adjust regularization strength

Screenshot 2024-04-27 at 11.22.05 PM — https://arxiv.org/abs/2404.16399

Reinforcement learning (RL) is a type of learning approach in which an agent interacts with the environment, gathers experience, and aims to maximize the rewards it receives from the environment. This typically involves a looping process of experience collection and enrichment and is referred to as online RL due to the requirements of policy deployment. Both on-policy and off-policy RL require online interaction, which may not be practical in certain domains due to experimental or environmental constraints. Offline RL algorithms are framed to extract optimal policies from static datasets.

Offline RL algorithms are used to leverage static datasets to learn effective and well-applicable policies. Recently, many approaches to this algorithm have met with great success. However, achieving the reported performance requires significant hyperparameter tuning specific to each dataset, and evaluation requires deploying the policy within your environment. This can pose a major problem, as the need for significant adjustments can affect the adoption of these algorithms in real-world areas. Offline RL faces challenges during evaluation of out-of-distribution (OOD) actions.

Introduced by researchers at Imperial College London TD3-BST (TD3 with Behavioral Supervisor Tuning), an algorithm that uses an uncertainty model to dynamically adjust the strength of regularization. The trained uncertainty model is incorporated into the normalized policy yield TD3 with behavioral observer adjustment (TD3-BST). TD3-BST helps dynamically adjust the regularization using an uncertainty network, which helps the learned policy optimize the Q-value with respect to the dataset mode. TD3-BST outperforms other methods and shows state-of-the-art performance when tested on the D4RL dataset.

Tuning TD3-BST is simple and straightforward and involves selecting the kernel (λ) selection and scale along with temperature using the key hyperparameters of the Morse network. For high-dimensional actions, increasing λ helps preserve the area around the mode tightly. Training with Morse-weighted behavioral cloning (BC) reduces the impact of distant mode BC losses and allows the policy to focus on single mode error selection and optimization. Furthermore, this study proves the importance of incorporating some OOD actions into the TD3-BST framework, which depends on λ.

Simple versions of RL, called one-step algorithms, have the potential to learn policies from offline datasets. These rely on weighted BC, which has some limitations, and relaxing policy objectives plays an important role in improving performance. The BST objective is integrated into existing IQL algorithms to overcome this problem and learn optimal policies while preserving within-sample policy evaluation. This new approach, IQL-BST, was tested using the same setup as the original IQL, and the obtained results closely match the original IQL, but with slightly slower performance on large datasets. Masu. However, relaxing the weighted BC using a BST objective provides good performance, especially on difficult to medium and large datasets.

In conclusion, researchers from Imperial College London said: TD3-BST, an algorithm that dynamically adjusts the regularization strength using an uncertainty model. Compared to previous methods on the Gym Locomotion task, TD3-BST achieves the highest score and yields better performance when learning from suboptimal data. Additionally, integrating policy regularization with ensemble-based uncertainty sources improves performance. Future work will include developing different methods for estimating uncertainty, alternative uncertainty measures, and optimal methods for combining multiple sources of uncertainty.

Please check paper. All credit for this study goes to the researchers of this project.Don't forget to follow us twitter.Please join us telegram channel, Discord channeland LinkedIn groupsHmm.

If you like what we do, you'll love Newsletter..

Don't forget to join us 40,000+ ML subreddits

Sajjad Ansari is a final year undergraduate student at IIT Kharagpur. As a technology enthusiast, he focuses on understanding the impact of his AI technology and its impact on the real world, delving into practical applications of AI. He aims to explain complex AI concepts in a clear and accessible way.

🐝 Join the fastest growing AI research newsletter from researchers at Google + NVIDIA + Meta + Stanford + MIT + Microsoft and more…

Source link