HyPO: A hybrid reinforcement learning algorithm that uses offline data for contrast-based preference optimization and online unlabeled data for KL regularization.

Machine Learning


https://arxiv.org/abs/2406.01462

A key aspect of AI research is fine-tuning large-scale language models (LLMs) to align their output with human preferences. This fine-tuning enables AI systems to generate useful and relevant responses that are in line with user expectations. The current paradigm in AI focuses on improving these models by learning from human preference data, addressing the complexities of manually specifying reward functions for different tasks. The two predominant techniques in this field are online reinforcement learning (RL) and offline contrastive methods, each with their own advantages and challenges.

A central challenge in fine-tuning LLMs to reflect human preferences is the limited coverage of static datasets. These datasets may need to adequately represent the diverse and dynamic range of human preferences in real-world applications. The dataset coverage issue is particularly pronounced when models are trained only on pre-collected data, which can lead to suboptimal performance. This issue highlights the need for ways to effectively leverage static datasets and real-time data to enhance model tuning to human preferences.

Existing approaches for fine-tuning preferences in LLM include online RL techniques such as approximate policy optimization (PPO) and offline contrast techniques such as direct preference optimization (DPO). Online RL techniques perform a two-step procedure: training a reward model on a fixed offline preference dataset, followed by RL training with on-policy data. This approach has the advantage of real-time feedback but is computationally intensive. In contrast, offline contrast techniques optimize the policy based only on pre-collected data, avoiding the need for real-time sampling but can suffer from overfitting and limited generalization capabilities.

Researchers from Carnegie Mellon University, Aurora Innovation and Cornell University Hybrid Preference Optimization (HyPO)This hybrid approach combines the power of both online and offline techniques, aiming to improve model performance while maintaining computational efficiency: HyPO integrates offline data for initial preference optimization, while Kullback-Leibler (KL) regularization uses online unlabeled data, ensuring that the model stays close to the reference policy and generalizes better beyond the training data.

HyPO uses an advanced algorithmic framework that leverages offline data and online samples for DPO objectives to control the inverse KL divergence. The algorithm iteratively updates the model parameters by optimizing the DPO loss while incorporating a KL regularization term derived from online samples. This hybrid approach effectively overcomes the shortcomings of pure offline methods, such as overfitting and lack of dataset coverage, by incorporating the advantages of online RL methods without the computational complexity.

HyPO's performance was evaluated on several benchmarks, including the TL;DR summarization task and popular chat benchmarks such as AlpacaEval 2.0 and MT-Bench. The results were impressive, with HyPO achieving a 46.44% win rate on the TL;DR task using the Pythia 1.4B model, compared to 42.17% for the DPO method. With the Pythia 2.8B model, HyPO achieved a 50.50% win rate, significantly outperforming DPO's 44.39%. Additionally, HyPO showed excellent control over inverse KL divergence, with values โ€‹โ€‹of 0.37 and 2.51 for the Pythia 1.4B and 2.8B models, respectively, compared to 0.16 and 2.43 for DPO.

HyPO also showed notable improvements in common chat benchmarks. For example, in the MT-Bench evaluation, the HyPO fine-tuned model achieved average scores of 8.43 and 8.09 on turns 1 and 2, respectively, beating the DPO fine-tuned model's scores of 8.31 and 7.89. Similarly, in AlpacaEval 2.0, HyPO achieved 30.7% and 32.2% win rates on turns 1 and 2, respectively, compared to 28.4% and 30.9% for DPO.

Experimental results highlight HyPO's ability to mitigate the overfitting problem commonly seen in offline contrasting methods. For example, when trained on the TL;DR dataset, HyPO maintains a significantly lower average validation KL score than DPO, indicating better alignment with the reference policy and reduced overfitting. This ability to leverage online data for regularization allows HyPO to achieve more robust performance across a range of tasks.

In conclusion, the introduction of Hybrid Preference Optimization (HyPO), which effectively combines offline and online data, overcomes the limitations of existing methods and improves the alignment of large-scale language models with human preferences. The performance gains demonstrated in empirical evaluation highlight the potential of HyPO to enable more accurate and reliable AI systems.


Please check paperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter And our Telegram Channel and LinkedIn GroupsUp. If you like our work, you will love our Newsletter..

Please join us 47,000+ ML subreddits

Check out our upcoming AI webinars here

Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at Indian Institute of Technology Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of AI and real-world solutions.

๐Ÿ Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *