Reinforcement learning with human feedback (RLHF) uses a reward model trained based on human preferences to tune large language models (LLMs) and encourage them to give higher rewards to generations. However, RLHF has several open problems. First, the fine-tuning process is often limited to a small dataset, which makes the model too specialized and loses the broad knowledge learned during pre-training. This can reduce the inference ability of LLMs and their performance on NLP benchmarks. Second, trying to maximize an imperfect reward model (RM) can cause problems because the LLM may find ways to exploit the RM's flaws. Finally, RLHF can reduce the variety of outputs, causing the model to collapse and produce similar responses.
In this paper, we discuss two related topics. The first topic is how to merge models. Recently, the idea of merging deep models in weight space, rather than in prediction space as is done in traditional ensembles, has attracted a lot of attention. This method is called weight averaging (WA), and the most common form of WA is LERP. This form was initially used to average checkpoints from a single run, either uniformly or using an exponential moving average (EMA). The second topic is the benefits of model merging. WA improves generalization by reducing variance, memory, and flattening the loss landscape. Moreover, merging weights combines the strengths of each, making it useful in multi-task settings.
A team from Google DeepMind proposed Weighted Average Reward Policy (WARP), a method to tune LLMs and optimize the Kullback-Leibler (KL) reward Pareto front of solutions. WARP uses three types of WAs at three stages of the tuning process, each for different reasons. First, it uses an exponential moving average of the policies in KL regularization as a flexible reference point. Second, it merges the fine-tuned policies into a single policy by spherical interpolation. Finally, it linearly interpolates between the merged model and the initialization to obtain features from pre-training. This process is repeated, with each final model serving as the starting point for the next iteration, reinforcing the KL reward Pareto front and resulting in better rewards with a fixed KL.
In the experiments conducted by the team, Gemma “7B” LLM was considered and fine-tuned with RLHF to become a better conversational agent. Additionally, REINFORCE policy gradients were also utilized to optimize the KL regularized reward. Policy samples were then generated using a dataset containing conversational prompts with Adam optimizer at temperature 0.9, batch size 128, learning rate 10−6, and 100 warm-up steps, and SLERP was applied to 28 layers individually. Note that this experiment relies on the largest available high-volume reward model, and therefore cannot use oracle-controlled RM.
The trained policy was compared side-by-side with Mistral and Mixtral LLM. Each policy generated candidate answers from a set of prompts listed in the Gemma technical report. As in Gemini 1.5, side-by-side preference ratings were calculated and scores of ±1.5, ±1, and ±0.5 were given for “much better”, “good”, and “slightly better”, respectively, with ties given a score of 0. A positive score signifies a better policy. The results prove that WARP is efficient, as the proposed policy was preferred over the Mistral variant and outperformed the previous Gemma “7B” release.
In conclusion, the Google DeepMind team introduced (WARP), a novel RLHF technique for tuning LLMs and optimizing the KL reward Pareto front of solutions. The technique uses three distinct model merging stages: (a) exponential moving average as a dynamic anchor during RL, (b) spherical interpolation to combine multiple independently rewarded policies, and (c) interpolation towards a shared initialization. This iterative application of WARP improves the KL reward Pareto front and tunes LLMs while preserving knowledge from prior training, and compares favorably with state-of-the-art baselines. In the future, WARP may help create safe and powerful AI systems by improving tuning and stimulating further research into model merging techniques.
Please check paperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter.
participate Telegram Channel and LinkedIn GroupsUp.
If you like our work, you will love our Newsletter..
Please join us 45,000+ ML subreddits
🚀 Create, edit, and enhance tabular data with Gretel Navigator, the first complex AI system now generally available. [Advertisement]

Sajjad Ansari is a final year undergraduate student at Indian Institute of Technology Kharagpur. As a technology enthusiast, he delves into practical applications of AI with a focus on understanding the impact of AI technology and its impact on the real world. He aims to express complex AI concepts in a clear and understandable manner.
🐝 Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…
