
Fine-tuning in reinforcement learning (RL) is a critical step to train a language model (LM) to behave in a specific way and follow human etiquette. In today's applications, RL fine-tuning involves multiple objectives for different human preferences and uses. Multi-objective fine-tuning (MOFT) is needed to train a multi-objective LM and overcome the limitations of single-objective fine-tuning (SOFT). For LMs, MOFT has been explored through prompt-based and parameter-based methods. Prompt-based methods fine-tune the LM by including reward weights in the prompts. However, this approach can be less effective in guiding the model and dependent on how the weights are presented. Additionally, zero-shot MOFT can perform poorly with intermediate weights that are not encountered during training.
The two main techniques to approach multi-reward alignment (MOFT) are prompt-based and parameter-based conditioning. Prompt-based conditioning includes approaches such as Personalized Soups (PS), which uses custom prompts to personalize a language model (LM) based on binary weights for different rewards. Rewarded Soups (RS) provides a zero-shot method by averaging parameters of separately trained LMs at inference time. A recent paper introduces a method to embed reward weights as singular values โโwithin the AdaLoRA framework. For KL realignment, realignment at decode time linearly mixes logits between ๐ref and another LM learned through SOFT, using the smallest KL weight.
The Google team proposed a general MOFT framework called Conditional Linguistic Policy (CLP) that uses parameter space conditioning and multi-task training. This method has better controllability than pure prompt-based techniques because it uses parameter conditioning from RS. Moreover, by fine-tuning the weighting of different rewards, CLP produces higher quality responses than zero-shot methods such as RS while at the same or better controllability. The team conducted a series of experiments and found that CLP outperforms Pareto-dominant RS and is easier to control than prompt-based MOFT. CLP consistently maintains these advantages under a variety of conditions, including different reward choices and model sizes.
The proposed method, CLP, uses parameter averaging techniques to learn a set of parameters that can be processed into a conditional language model (LM) for any weighting across rewards and KL. The learning algorithm samples different weightings to improve the Pareto front for all weightings at once. The approach involves multi-task learning across different weightings to maximize the MOFT objective. Automated evaluations using Gemini 1.0 Ultra show that CLP is more adaptive and produces better responses than existing baselines. The team proposed a new theory showing that zero-shot methods can be nearly Pareto optimal if the optimal policy is tailored to individual rewards.
Benchmark results were obtained with the following settings: single reward, multi-KL regularization, two rewards, fixed KL regularization, and three rewards, fixed KL regularization. With single reward, CLP is twice as computationally efficient as DeRa during inference, as DeRa makes two LM calls per token. Multi-task training allows our method to improve over the zero-shot RS baseline in terms of performance. Also, full-CLP and attn-CLP maintain a more spread-out, steerable Pareto front compared to logit-CLP and prompt baselines. That is, attn-CLP strikes a good balance between Pareto front and steerability while using fewer parameters than current baselines.
In this paper, our team introduced Conditional Language Policy (CLP), a flexible framework for MOFT that uses multi-task training and efficient parameter fine-tuning to create adaptive language models (LMs) that can efficiently balance different distinct rewards. The paper includes extensive benchmarking and ablation studies to understand factors that help develop steerable LMs within the CLP framework. The team also proposed theoretical results showing how a zero-shot approach works and the need for multi-task training for near-optimal behavior. Future work includes other conditioning mechanisms such as soft tokens, automating the adjustment of weight sampling distributions, and addressing nonlinear reward scalarization.
Please check paperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter And our Telegram Channel and LinkedIn GroupsUp. If you like our work, you will love our Newsletter..
Please join us 47,000+ ML subreddits
Check out our upcoming AI webinars here

Sajjad Ansari is a final year undergraduate student at Indian Institute of Technology Kharagpur. As a technology enthusiast, he delves into practical applications of AI with a focus on understanding the impact of AI technology and its impact on the real world. He aims to express complex AI concepts in a clear and understandable manner.