R1 paper reaches natural cover and correspondent author: Liang Wenfeng

Machine Learning


What a surprise!

But that's only natural!

The cover of Nature's latest issue features research on Deepseek-R1.

This refers to “Deepseek-R1: LLMS's ability to encourage reasoning,” published by Arxiv's Deepseek in January this year. The corresponding author of this nature paper is liang wenfeng.

Paper link: https://www.nature.com/articles/S41586-025-09422-Z

In the recommended introduction on the cover, Nature writes:

If you can train a large model to plan the steps required to solve a problem, you can often better solve the problem. This kind of “inference” is similar to the way humans handle more complex problems. However, it poses a major challenge to artificial intelligence and requires manual intervention to add labels and annotations. In this week's issue, Deepseek researchers will reveal how models can be trained to infer with minimal human input.

The DeepSeek-R1 model is trained using reinforcement learning. In this type of learning, models will be rewarded with high scores if they answer mathematical problems correctly and are punished for answering incorrectly. As a result, it learns to reason – solve the problem step by step and reveal these steps – it is more likely that you will arrive at the right answer. This allows DeepSeek-R1 to self-verify and self-reflection, checking performance before giving answers to new questions, thereby improving the performance of programming and graduate-level scientific questions.

Furthermore, in this issue, Nature appreciates the open model of the DeepSeek-R1.

It's worth noting R1 is considered to be the first large-scale language model to pass peer review in the prestigious academic journal.

“This is a highly welcome precedent. Without industry standards to publicly share most of the research and development process, it is difficult to assess the potential risks of these systems,” said Luis Stanstall, a hugging face machine learning engineer and one of the reviewers of the paper.

In response to review comments, the DeepSeek team not only circumvented anthropomorphic descriptions of the paper's models, but also added technical details on training data and types of security. “Performing a rigorous peer review will undoubtedly effectively test the reliability and practical value of a model. Other companies should follow this example,” said Huan Sun, an AI researcher at Ohio State University.

clearly, The AI ​​industry today is full of great demonstrations at press conferences and constantly refreshed leaderboard scores.

However, as pointed out in the article, benchmark tests can be “operated.” Submitting model design, methodologies, and limitations for review to an independent external expert can effectively squeeze out those “waters.”

Peer reviews act as a fair “gatekeeper.” AI companies should move from self-promotion such as “melon vendors who praise their products” to supporting their claims with solid evidence and a reproducible process.

Thus, while the Deepseek-R1 paper itself has scientific value, the first LLMs that have been peer-reviewed and passed in mainstream journals could have deeper “procedural value.”

Incorporating LLM into an independent peer review system is an important step from the “technology competition” to the “science field” and is important to limit industry disruption and build public trust.

Next, let's review this groundbreaking research. We also recommend that you read carefully the papers published in Nature for more details on the supplement:

Deepseek-R1's multi-stage pipeline

Previous studies relied on large amounts of monitored data, primarily to improve model performance. The Deepseek development team has opened up a new way of thinking. Large scale reinforcement learning can significantly improve the inference capabilities of the model without using monitored fine tuning (SFT) as a cold start. Adding a small amount of cold start data will make the effect even better.

To achieve this, they developed the deepseek-r1-zero. Specifically, the DeepSeek-R1-Zero comes in three unique designs:

First, I'll use it Group Relative Policy Optimization (GRPO) To reduce training costs. GRPO does not require an evaluation model of the same size as the policy model, but it estimates the baseline directly from the group score.

It's the second Reward design. How rewards are designed to determine the direction of RL optimization. The solution to Deepseek is to use two complementary reward mechanisms of Precision and format.

The third point is Training template. Based on GRPO and reward design, the development team designed a simple template as shown in Table 1 to guide the base model. This template requires DeepSeek-R1-Zero to first provide the inference process and then to provide the final answer. This design standardizes the basic structure without restricting or biasing content, such as not enforcing the use of reflective reasoning or specific problem-solving methods. This minimal intervention design allows for clear observation of model progress in RL.

During the training process, DeepSeek-R1-Zero exhibited important self-evolutionary abilities. I learned to generate hundreds to thousands of inference tokens, allowing me to explore and refine my thought process more deeply.

As training progressed, the model also developed several advanced behaviors, including the ability to explore reflecting various problem-solving methods. Although they were not pre-configured, they appeared naturally in the reinforcement learning environment.

In particular, the development team observed an interesting “AHA moment.” During the middle stages of training, DeepSeek-R1-Zero learned to allocate more rational thinking time by reevaluating the initial methods. This may be the appeal of reinforcement learning. As long as the correct reward mechanism is provided, the model can independently develop advanced problem-solving strategies.

However, DeepSeek-R1-Zero has some limitations, such as answers and poor readability of mixed languages.

Cold Start Reinforcement Learning

Unlike the DeepSeek-R1-Zero, to prevent the base model from experiencing an unstable cold start stage in the early stages of RL training, the development team constructed and collected small amounts of long thought chain (COT) data for R1, and fine-tuned the model as an initial RL actor. To collect such data, the development team investigated several methods. Take a small number of prompts of long COTs as an example, prompting the model to generate detailed answers through direct reflections and verification, collecting the output of DeepSeek-R1-Zero in a readable format, and improving the results through post processing by human annotators.

Deepseek has collected thousands of cold start data and tweaked the DeepSeek-V3-Base as the starting point for RL. Compared to the DeepSeek-R1-Zero, the advantages of cold start data include:

Readability: One of the main limitations of DeepSeek-R1-Zero is that its content is not suitable for normal reading. Responses may combine multiple languages ​​or lack markdown formats, so emphasize the user's answer. In contrast, when creating cold start data for R1, the development team designed a readable mode with a summary at the end of each response, filtering out unfriendly responses.

Potential: By carefully designing Cold Start Data Mode with human prior knowledge, the development team observed better performance compared to the DeepSeek-R1-Zero. The development team believes iterative training is a better way to infer models.

Reasoning-oriented reinforcement learning

After fine-tuning the DeepSeek-V3-Base with cold-start data, the development team adopted the same massive reinforcement learning training process as the DeepSeek-R1-Zero. This stage focuses on improving the inference ability of the model, particularly in inference-intensive tasks such as coding, mathematics, science, and logical reasoning.

To alleviate mixed language problems, the development team introduced language consistency rewards in RL training. This is calculated as a percentage of the target language words in the COT. Although ablation experiments show that this alignment leads to a slight decline in model performance, this reward has been shown to be more readable and fit human preferences.

Finally, the development team directly added accuracy and language consistency rewards for the inference task to form the final reward. They then conducted reinforcement learning (RL) training on fine-tuned models until they converged on the inference task.

Rejection sampling and fine tuning of monitoring facilities

Once inference-oriented reinforcement learning converges, the development team uses the generated checkpoints to collect SFT (monitored fine-tuning) data for subsequent rounds. At this stage, data from other domains is combined to enhance the model's capabilities in writing, role-playing, and other common tasks.

The development team organized the inference prompts and generated inference trajectories by performing reject sampling from the above reinforcement learning training checkpoints. At this stage, the dataset was expanded by merging other data. Some data used the generated reward model, and ground truth and model predictions were entered into DeepSeek-V3 for judgment.

Additionally, the development team excluded the chain of thought with mixed languages, long paragraphs and code blocks. For each prompt, they sampled multiple answers and kept only the correct answers. Finally, the development team collected approximately 600,000 training samples related to inference.

Reinforcement learning for all scenarios

To further adapt the model to human preference, a second stage of reinforcement learning should be implemented here, aimed at improving the usefulness and harmlessness of the model while improving inference capabilities.

Specifically, researchers trained the model using a combination of reward signals and various rapid distributions. For inference data, they follow the method outlined in DeepSeek-R1-Zero. This uses rule-based rewards to guide the learning process in the fields of mathematics, code, and logical reasoning. For general data, they used reward models to capture human preferences in complex and nuanced scenarios.

Finally, the integration of reward signals and diverse data distributions allows us to train models that perform excellent performance in inference, prioritizing usefulness and harmlessness.

Distillation: Enable small models with inference capabilities

To ensure that more efficient miniature models have the inference capabilities of DeepSeek-R1, the development team directly tweaked open source models organized by DeepSeek-R1 using 800,000 samples such as Qwen and Llama. The findings show that this simple distillation method significantly improves the inference capabilities of small models.

Thanks to the innovations in multiple technology above, numerous benchmark tests by the development team show that DeepSeek-R1 has achieved hard power comparable to the industry's larger model of SOTA inference. You can see the following results specifically:

See the original paper for more information.

This article is from WeChat's official account, “Mechanical Intelligence” (ID: Almosthuman2014). The author is machine intelligence with a focus on AI. It has been published with approval by 36kr.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *