At the core of reinforcement learning, models are trained to prioritize outputs that yield stronger rewards, optimizing for quality as well as accuracy. In classic deep reinforcement learning, an agent operates within its environment and learns from its results. For LLM, the “action” is to produce text, and the reward signal reflects how good that output is according to the chosen objective.
In reality, this process unfolds over several stages. It starts with a pre-trained model built on a large text dataset that underlies everything the model knows. From there, supervised fine-tuning (SFT) sharpens the model’s behavior using high-quality instructional data, teaching it not just what to say, but how to respond.
While the basic model may “know” the correct answer to a medical question, SFT teaches us to derive the most important information, flag uncertainty, and avoid false confidence. This behavior is the difference between a knowledgeable model and a model that can actually be trusted.
After that, the reinforcement learning phase begins. A human annotator or AI system ranks the model’s output, and that preference data is used to train a reward model (essentially a signal that has learned what “good” looks like).
A policy model is then optimized for that signal using a reinforcement learning algorithm such as Proximity Policy Optimization (PPO). PPO uses policy gradient methods to update the model while KL divergence constraints prevent it from deviating too much from its original behavior. Alternatively, Direct Preference Optimization (DPO) skips the reward model entirely and incorporates preference learning directly into the training objective through gradient descent in the neural network itself.
The result is a model that does more than simply predict the probability distribution of the next token. Learn how to generate output that reflects human preferences, domain-specific goals, and real-world constraints.
