DeepSeek-R1 encourages LLMS inference through reinforcement learning

GRPO

GRPO⁹ DeepSeek-R1-The RL algorithm used to train Zero and DeepSeek-R1. Originally, it was proposed to simplify the training process and reduce the resource consumption of proximal policy optimization (PPO)³¹It is widely used in the RL stage of LLMS³². The GRPO pipeline is shown in the extended data. Figure 2.

For each question QGRPO samples the group of outputs {o₁, o₂,… o_g}From the old policy ${\pi}_{{\theta}_{{\rm {old}}}}$ And optimize the policy model π_θ By maximizing the following objectives:

$$\begin{array}{ll}&{{\mathcal {j}}}_{{\rm{grpo}}}}}}}}}(\mathbb{e}}}}}}}}}}}}}}}}}}}[q \sim P(Q),{\{{o}_{i}\}}_{i=1}^{G} \sim {\pi }_{{\theta }_{{\rm{old}}}}(O| q)]\\&\frac{1}{g}\mathop{\sum}\limits_{i=1}^{g}\left(\min\left(\frac{{\pi}_{{\theta}({o}({o}))_{i}|Q)}}_{{\theta}_{{\rm{old}}}({o}}_{i}|q)}{a}_{i},\,\text{clip}\left(\frac{{\pi}_{\theta}({o}_{i}|q)}{{\pi}_{{\theta}_{{\rm {old}}}}({o}_{i}|q)}, 1-{\epsilon}, 1+{\epsilon}\right){a}_{i}\right) – \beta {{\mathbb {d}}}_{kl}({\pi}_{\theta}||{\pi}_{{\rm {ref}}})\right), \end {array}$$

(1)

$${{\mathbb{d}}}_{{\rm{kl}}}({\pi}_{\theta}||{\pi}_{{\rm{ref}}})=\frac{{\pi}_{{\rm{ref}}}({o}_{i}|q)}{{\pi}_{\theta}({o}_{i}|q)} – \log\frac{{\pi}_{{\rm{ref}}({o}}_{i}|q)}{{\pi}_{\theta}({o}_{i}|q)} – 1, $$

(2)

Where π_ref This is a reference policy, ϵ andBeta It's a hyperparameter a_I The advantage calculated using reward groups {r₁,r₂,…r_g}Supports output for each group:

$${a}_{i}=\frac{{r}_{i}-{\rm{mean}}(\{{r}_{1}, {r}_{2}, \cdots\,,, {r}_{g}\})}{{\rm{std}}(\{{r}_{1}, {r}_{2}, \cdots ot\,,, {r}_{g}\})}

(3)

Comparison of GRPO and PPO in Supplementary Information, Section 1.3 is shown.

Reward design

The reward is the source of the training signal and determines the direction of RL optimization. For DeepSeek-R1-Zero, rule-based rewards are used to provide accurate feedback to data in mathematics, coding, and logical inference domains. For DeepSeek-R1, we extend this approach by incorporating both rule-based rewards for inference-oriented data and model-based rewards for general data, thereby increasing the adaptability of the learning process across diverse domains.

Rule-based rewards

Our rules-based reward system consists primarily of two types of rewards, as well as accuracy and formatting rewards.

Evaluate whether the accuracy reward response is correct. For example, for mathematical problems with deterministic results, the model is necessary to provide the final answer in the specified format (e.g., in the box) and allow for reliable verification of accuracy. Similarly, for code conflict prompts, the compiler can be used to evaluate the model's responses to a set of predefined test cases, thereby generating objective feedback on accuracy.

Format Reward Completion of the accuracy reward model by implementing specific format requirements. In particular, the model is incentivized to encapsulate the inference process within the specified tag. and . This explicitly portrays the model's thought processes, improving interpretability and facilitating subsequent analysis.

$${{\rm {reward}}}_{{\rm {rool}}} = {{\rm {reward}}}_{{\rm {acc}}}+{{{\rm {seport

(4)

Accuracy, rewards and format rewards are combined with the same weight. In particular, we refrain from applying neural reward models, whether they are outcome-based or process-based inference tasks. This decision is based on the observation that neural reward models may reward hacking during large RLs. Furthermore, retraining such models requires substantial computational resources, introducing even more complexity into the training pipeline, thus complicating the overall optimization process.

Model-based rewards

For general data, we rely on models to reward human preferences in complex and nuanced scenarios. It builds on the DeepSeek-V3 pipeline and uses a similar distribution of preferred pairs and training prompts. For usefulness, we focus only on the final summary, emphasizing the use and relevance of responses to the user, ensuring that evaluations minimize interference with the underlying inference process. For harmlessness, we evaluate the overall response of the model, including both the inference process and the summary, to identify and mitigate potential risks, biases, or harmful content that may occur during the production process.

Useful reward models

For useful reward model training, first generate a preferred pair by prompting DeepSeek-V3 using the arena hard prompt format listed in section 2.2 of Supplementary Information, where each pair consists of a user query along with two candidate responses. For each preference pair, query DeepSeek-V3 four times and randomly assign responses as response A or response B to alleviate position bias. The final priority score is determined by averaging four independent judgements, and only pairs with score differences (δ) greater than 1 to ensure meaningful distinctions. Additionally, to minimize length-related bias, select and rejected responses across the dataset ensure comparable length. In total, 66,000 data pairs were curated to train the reward model. All prompts used in this dataset are irrational questions and are sourced from either publicly available open source datasets or users who expressly agree to share data for model improvement purposes. The reward model architecture is consistent with the DeepSeek-R1 architecture by adding reward heads designed to predict scalar preference scores.

$${{\rm{reward}}}_{{\rm{honpful}}} ={{\rm}{rm}}_{{\rm{hopful}}}}({{\rm{response}}_{{\rm{a}}}, {{\rm{response}}}_{{\rm{response}}_{{\rm{b}}})$$

(5)

Useful reward models were trained with a batch size of 256 learning rate of 6×10^-6 For a single epoch on the training dataset. The maximum sequence length during training is set to 8,192 tokens, but no explicit restrictions are imposed during reward model inference.

Safety Reward Model

To assess and improve the safety of the model, we curated a dataset of 106,000 prompts containing model-generated responses announced as “safe” or “unsafe” in accordance with predefined safety guidelines. Unlike the pairwise loss used in useful reward models, the safety reward models were trained using a point-wise methodology to distinguish between safe and unsafe responses. The training hyperparameters are the same as useful reward models.

$${{\rm {reward}}}_{\text {safety}} = {{\rm {rm}}}_{\text {safety}}({\rm {response}})$$

(6)

For general queries, each instance is classified as belonging to either a safe dataset or a useful dataset. General rewards, rewards_General,Assigned to each query corresponds to each reward defined in the relevant dataset.

Training details

DeepSeek-R1-Zero Training Details

To train deepseek-r1-zero, set the learning rate to 3×10^-6Kullback-Leibler (KL) coefficient is 0.001, and sampling temperature is 1 to 1 for rollout. For each question, sample 16 outputs with a maximum length of 32,768 tokens and 65,536 tokens before the 8.2K step. As a result, both the performance and response length of the DeepSeek-R1-Zero jumps significantly at 8.2K steps, with training continuing a total of 10,400 steps corresponding to 1.6 training epochs. Each training step consists of 32 unique questions, resulting in a training batch size of 512 per step. Replace the reference model with the latest policy model every 400 steps. To accelerate training, each rollout produces 8,192 outputs, randomly divided into 16 mini-batches, trained on only a single internal epoch.

Learn more about first RL stage training

In the first stage of RL, set the learning rate to 3×10^-6KL coefficient 0.001, GRPO clip ratioϵ For rollouts, it will be up to 10, and the sampling temperature will be 1 to 1. Samples 16 outputs with a maximum length of 32,768 for each question. Each training step consists of 32 unique questions, resulting in a training batch size of 512 per step. Replace the reference model with the latest policy model every 400 steps. To accelerate training, each rollout produces 8,192 outputs, randomly divided into 16 mini-batches, trained on only a single internal epoch. However, to alleviate the problems with mixing languages, we introduce language consistency rewards during RL training. This is calculated as a percentage of the target language words in the COT.

$${{\rm {reward}}}_{{\rm {Language}}} = \frac {{\rm {num}}}({{\rm {words}}_{{\rm {target}}})} {{\rm {num}}({\rm {words}})} $$

(7)

Ablation experiments in supplemental information show that Scnt 2.6 results in slight degradation in model performance, but this reward is consistent with human preferences and makes it easier to read. Apply language consistency rewards to both inference and irrational data by adding them directly to the final reward.

Note that clip ratios play an important role in training. A low value truncates the slope of many tokens and reduces the performance of the model, while a high value can cause instability during training. Details of the RL data used at this stage are provided in Section 2.3 of the Supplementary Information.

Details of the second RL stage training

Specifically, we train the model using a combination of reward signals and diverse rapid distributions. For inference data, follow the methodology outlined in DeepSeek-R1-Zero. It uses rule-based rewards to guide learning in mathematics, coding, and logical reasoning domains. During the training process, we can see that COT often shows a mix of languages, especially when the RL prompt includes several languages. For general data, we use reward models to guide training. Ultimately, by integrating reward signals with diverse data distributions, we can develop models that not only excel inference, but also assign priorities to usefulness and harmlessness. Considering a batch of data, the reward can be formulated as follows:

$${\rm {reward}} = {{\rm {reward}}}_{{\rm {Reasoning}}}+{{\rm {reward}}}_{{\rm {general}}}+{{\rm {reward}}}_{{\rm {Language}}} $$

(8)

Where

$${{\rm {reward}}}_{{\rm {Reasoning}} = {{\rm {reward}}}_{{\rm {rule}}}} $$

(9)

$${{\rm{reward}}}_{{\rm{general}}} = {{\rm{reward}}}_{\_model}}+{{\rm{reward}}}_{^{^{\rm}}_{{\rm}}_{{\rm}}}_{{\rm}}}

(10)

The second stage of RL retains most of the parameters of the first stage. The important difference is a temperature drop of 0.7, as it is known that higher temperatures at this stage lead to inconsistent generations. The stage consists of a total of 1,700 training steps, during which time rewards based on general instructional data and priorities are incorporated only in the final 400 steps. We find that more training steps using model-based prioritized reward signals can lead to reward hacking. This is recorded in section 2.5 of the Supplementary Information.

Source link