The emergence of ChatGpt in 2022 completely changed the way the world began to perceive artificial intelligence. ChatGpt's incredible performance has led to the rapid development of other powerful LLMS.
It can be roughly said that ChatGpt is an upgraded version of GPT-3. However, compared to previous GPT versions, OpenAI developers didn't just use more data and complex model architecture this time. Instead, they designed an incredible technique that would allow for breakthroughs.
In this article, we will explain about rlhf. This is a fundamental algorithm implemented at the core of ChatGpt, which exceeds the limits of human annotation in LLM. The algorithm is based on proximal policy optimization (PPO), but the details of reinforcement learning that are not the focus of this article will keep the explanation simple.
NLP development before chatgpt
To properly dive into the context, before ChatGpt, let's remember how LLM was developed in the past. In most cases, LLM development consisted of two stages.

Language modeling is included before training. This is the task in which the model attempts to predict hidden tokens in the context. The probability distribution generated by the hidden token model is compared to the ground truth distribution of loss calculations and further backpropagation. In this way, the model learns the semantic structure of language and the meaning behind words.
If you want to know more Pre-training and fine-tuning frameworkcheck out my article about Bert.
The model is then tweaked with downstream tasks that may include a variety of purposes, such as text summaries, text translation, text generation, and question answering. In many situations, fine-tuning can include enough text samples to allow the model to properly generalize learning and avoid overfitting.
This is where the limits of fine tuning appear. Data annotation is usually a time-consuming task that humans perform. For example, try a task that asks questions. To build a training sample, you will need a manually labeled dataset of questions and answers. Every question needs an accurate answer provided by a human. for example:

The reality is that to train LLMs you will need millions or billions of pairs (questions, answers): This annotation process is very time-consuming and not scaled well.
rlhf
It's the perfect moment to dive into the details of RLHF, as you understand the main issues.
If you are already using ChatGpt, you probably have come across a situation where CHATGPT asks you to choose the right answer for your first prompt.

This information is actually used to continually improve ChatGPT. Let's understand how to do that.
First of all, it is important to note that choosing the best answer from two options is a much easier task for humans than providing accurate answers to unsolved questions. The ideas we are trying to see are based precisely on that. We simply select the answer from two possible options to create an annotated dataset.

Response generation
In LLMS, there are several possible ways to generate responses from the distribution of predicted token probability.
- There is an output distribution p Once the token is exceeded, the model always has a chance to decisive selection of the token.

- There is an output distribution p On tokens, the model randomly samples the tokens according to the assigned probability.

This second sampling method results in more randomized model behavior and allows for the generation of diverse text sequences. For now, let's assume that we generate many pairs of such sequences. The dataset of the resulting pair is human-labeled. For every pair, humans are asked which of the two output sequences better fit the input sequence. Annotated datasets are used in the next step.
In the context of RLHF, an annotated dataset created this way is called “Human Feedback”.
Reward model
Once the annotated dataset is created, it is used to train a so-called “reward” model. The goal is to learn to numerically estimate how good or bad the answer given to the first prompt is. Ideally, We hope that the reward model produces positive values for good responses and negative values for bad responses.
Speaking of reward models, its architecture is exactly the same as the first LLM, but with the exception of the last layer, instead of outputting a text sequence, the model outputs a float value. This is the estimate of the answer.
You must pass both the initial prompt and the generated response as input to the reward model.
Loss function
If the annotated dataset does not have numerical labels, you can logically ask how the reward model will learn this regression task. This is a reasonable question. To address that, we use an interesting trick: Pass both good and bad answers via reward models. This ultimately outputs two different estimates (rewards).
Next, we smartly construct a relatively comparable loss function.

Let's plug in some argument values of the loss function and analyze their behavior. Below is a table with plugin values:

You can quickly observe two interesting insights.
- When the difference between r₊ and r₋ It's negativei.e., a better response received a lower reward than a worse reward. The loss value will then increase proportionally with the difference in reward. This means that you will need to adjust the model significantly.
- If the difference between r₊ and r₋ is positivei.e., a better response receives a higher reward than a worse reward, and then the loss acquires a boundary within a much lower value of the interval (0, 0.69).
The good thing about using such a loss function is that the model learns the appropriate rewards of the generated text on its own, and we (humans) do not need to explicitly evaluate all the responses numerically. It simply provides a binary value.
Original LLM Training
Train the original LLM using a trained reward model. To do this, a series of new prompts can be sent to the LLM, generating an output sequence. Next, input prompts are fed into the reward model along with the output sequence to estimate how good their responses are.
After generating a numerical estimate, the information is used as feedback to the original LLM and performs weight updates. A very simple but elegant approach!

In most cases, the last step of adjusting the weights of the model uses a reinforcement learning algorithm (usually done by proximal policy optimization – PPO).
Even if it's not technically correct, if you're not familiar with reinforcement learning or PPO, you can roughly think of it as backpropagation, just like regular machine learning algorithms.
inference
During inference, only the original trained model is used. At the same time, you can continually improve your model in the background by collecting user prompts and periodically assessing which of the two responses is superior.
Conclusion
In this article, we studied RLHF. This is a very efficient and scalable technique for training modern LLMs. The elegant combination of LLM and reward models allows for a significant simplification of human-performing annotation tasks.
RLHF is used at the core of many popular models, including ChatGpt, Claude, Gemini, and Mistral.
resource
All images are from the author unless otherwise stated.