
Large-scale language models (LLMs) have made significant progress in recent years, primarily due to their improved ability to follow human commands efficiently. Reinforcement learning from human feedback (RLHF) is the primary technique for matching LLMs to human intent. This method works by optimizing a reward function, which can be reparameterized within the LLM's policy or can be an independent model.
To derive this reward function, data on human preferences for prompt-response pairs is used. The diversity of answers found in the preference data is a key factor in the effectiveness of this alignment. This diversity prevents the reward model from becoming trapped in a local optimum and facilitates the development of more adaptive and powerful language models.
Alignment can be primarily online or offline. Offline alignment attempts to manually generate a range of responses to pre-determined prompts; however, this approach is less effective at covering a wide range of natural language possibilities. In contrast, online alignment employs an iterative procedure that generates new configuration data for training the reward model through post-sampling feedback of answers from the LLM.
In this approach, sampling is random, allowing exploration of out-of-distribution (OOD) regions. On the other hand, the only goal of LLM in most online RLHF settings is to maximize the expected reward from the collected data. Due to passive exploration, the response often centers around local optima, which can lead to overfitting and premature convergence, leaving high-reward regions unexplored.
Preference optimization has proven highly effective in aligning large language models (LLMs) to human goals, especially when combined with reinforcement learning from human feedback. Online feedback gathering from humans or AI on model outputs typically leads to more competent reward models and better-tuned LLMs through an iterative process. This is in contrast to offline tuning that relies on a fixed dataset. However, developing a globally accurate reward model requires systematic research to generate a wide variety of responses across a vast domain of natural languages. This condition cannot be met by simply utilizing random sampling from a typical reward-maximizing LLM.
To address this issue, a two-level objective is proposed that is optimistically biased towards potentially high-reward responses, which actively explores out-of-distribution (OOD) regions. The resulting approach, called Self-Exploring Language Models (SELMs), uses a reparameterized reward function to solve the inner-level problem, eliminating the need for a separate reward model, and iteratively updates the LLM with a simple objective.
Compared with Direct Preference Optimization (DPO), SELM aims to improve search efficiency and reduce the indiscriminate preference of unseen extrapolation. Based on experimental results, SELM, when modified with Zephyr-7B-SFT and Llama-3-8B-Instruct models, can significantly improve performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0. SELM also performs well within common academic benchmarks in a variety of contexts.
In conclusion, by ensuring that LLMs not only follow instructions exactly but also consider a wide range of possible responses, this approach represents a great advancement in matching LLMs with human intent, ultimately leading to more capable and reliable language models.
Please check Papers and GitHub. All credit for this work goes to the researchers of this project. Also, don't forget to follow us: twitter. participate Telegram Channel, Discord Channeland LinkedIn GroupsUp.
If you like our work, you will love our Newsletter..
Please join us 43,000+ ML subreddits | In addition, our AI Event Platform
Tanya Malhotra is a final year undergraduate student from the University of Petroleum and Energy Studies, Dehradun, doing a BTech in Computer Science Engineering with specialisation in Artificial Intelligence and Machine Learning.
She is an avid fan of Data Science and has strong analytical and critical thinking skills with a keen interest in learning new skills, group leadership and managing organized work.