Over the past few years, large-scale language models have received a great deal of attention from researchers and the general public due to their superior capabilities. These models, such as GPT-3, can generate human-like text, converse with users, perform tasks such as summarizing text, answering questions, and even write code. There are several scenarios in which the quality of the generated text plays an important role in language model evaluation. For example, for a good user experience, users expect models to produce error-free executable code or write poetry that exhibits a certain level of creativity. Therefore a loss function is used to obtain these attributes. Most of the research so far has focused on the use of loss functions based on next token predictions or other similar criteria. However, another future research domain focuses on incorporating human feedback as a measure of performance and using that feedback as a loss for optimizing models. This idea is known as Reinforcement Learning from Human Feedback (RLHF), and several existing powerful models such as ChatGPT, GPT-4, and Claude currently employ this technique.
Adding another model to the list of successful applications of RLHF, Hugging Face researchers were trained to use RLHF in Hugging Face’s Transformer Reinforcement Learning (TRL) to answer questions from Stack Exchange. We are releasing StackLLaMA, a 7B parameter language model based on Meta’s LLaMA model. ) library. The researchers fine-tuned her original LLaMA model in Meta using a combination of three main strategies: supervised fine-tuning (SFT), reward/preference modeling (RM), and reinforcement learning. Human Feedback (RLHF). The model can be accessed here and the entire training pipeline is available as part of her TRL library.
Hugging Face researchers pointed out that RLHF is just a fine-tuning step. Determining the initial model is therefore an important preliminary step. Therefore, the researchers chose for their purposes the largest recently introduced language model developed by Meta AI, the he LLaMA model. This collection of underlying language models outperforms GPT-3 and is available with parameters ranging from 7B to 65B. The researcher decided to use he 7B parameter model for the experiment. The researchers also pointed out that good datasets play an important role in giving appropriate human feedback. In this regard, the researcher chose his StackExchange dataset. This includes his 10+ million question and answer pairs on a wide range of topics, including code snippets from StackOverflow. Another interesting feature of this dataset is that it consists of the number of upvotes and the labels of the accepted answers. This helped a lot with the reward model.
Before training a reward model and tuning it with reinforcement learning, the Hugging Face team sought to fine-tune the model for a specific domain (in their case, question-answering tasks) using causal language modeling goals. To achieve this, the team used a technique called packing to train language models on a subset of the StackExchange dataset. This efficient technique involves appending extra tokens to the end of sequences shorter than the desired length or truncating sequences longer than the desired length. The model is then trained for thousands of epochs, ending the fine-tuning step. The next step was to train the reward model. Because fine-tuning models using RLHF directly with manual annotation is very time-consuming and labor-intensive, researchers employ specific tactics that mimic the way humans evaluate text to reward them. I considered training a model. One such strategy is to predict annotations based on a particular score or binary value that indicates whether the annotation is good or bad. Since the StackExchange dataset consists of at least two answers for every question, researchers chose their preferred answer based on a specific score metric. Researchers applied this methodology to a subset of the dataset to test reward models. A final accuracy of 67% is quite commendable given how difficult it is for even a human annotator to complete the task.
With the fine-tuned language model and reward model in hand, the final step the researchers took was to run the RL loop. This procedure can be summarized in three main stages: generating responses from prompts, evaluating responses using a reward model, and using the evaluations to perform an optimization step in a reinforcement learning policy. Based on previous work on training language models using RL, we observed that models can learn to exploit reward models by generating complete gibberish. This causes the reward model to allocate high rewards. To combat this, the researchers added a penalty to the reward. No problem.
In a nutshell, the Hugging Face researchers’ work consists of creating a human-annotated dataset, fitting a language model to the domain, training a reward model, and finally using RL to train the model. can be summarized as training StackLLaMA is his main stepping stone in the RLHF world, but this model is far from perfect. There are some ongoing issues that the Hugging Face team is working hard to resolve. For example, occasional spikes in losses that lead to model instability. The model is now open for educational and research purposes on his RLHF and TRL libraries. The team also explicitly states that prompts typed into the app are collected in order to further fine-tune the model. must refrain.
check out Demos, code, and blogs. All credit for this research goes to the researchers of this project.Also, don’t forget to participate Our 18k+ ML SubReddit, cacophony channeland email newsletterWe share the latest AI research news, cool AI projects, and more.
🚀 Check out 100 AI Tools in the AI Tools Club
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Goa. She has her passions in the fields of machine learning, natural language processing, and her web development. She enjoys learning more about the technical field by participating in some challenges.