Test time scaling supplementary learning teacher

June 23, 2025

summary

Here we present new ways to teach large-scale language models (LLM).

Reinforcement Learning Teacher (RLT) Train the teacher model to generate explanations from question-answer pairs. This has been optimized to improve understanding of the student model. Rather than solving problems from scratch, teachers are rewarded based on how effectively the explanation helps students recover the correct solution.

Many advanced inference models, such as DeepSeek R1, follow a two-stage learning process. First, the teacher model is trained and the output of it is used to train the student model, which becomes the final product. As mentioned above, traditionally, these teacher models are trained using expensive reinforcement learning (RL). In this model, the model must “learn” to “solve” complex problems from scratch, and only rewards when you get the right answer. This process is slow, expensive, and often narrowly focused, requiring a carefully filtered output from the teacher to ensure students can learn effectively.

Our methods tackle these challenges accurately. Instead of teaching by solving “learning to teach,” our new Reinforcement Learning Teacher (RLTS) will, like a great human instructor, leave it to output clear, step-by-step explanations based on known solutions. The key is to provide the teacher with the correct answers during training, along with questions. This is paid off in solving the problem itself, but how useful the explanation is for students. This feedback loop will be much more effective in keeping teacher training with its true purpose (helping students). You can also use small and efficient models that are otherwise unable to solve the problem yourself.

Average performance across the 2024 American Invited Mathematics Examination (AIME), Competitive Mathematics, and Graduate Level Q&A Benchmark (GPQA).

The results are surprising. Compact teachers with only 7B parameters are better at teaching reasoning skills than ordering for expansion, and are much faster to train advanced AI at a more affordable price. This is not only the same size student (26.3% of the set of tasks when using DeepSeek R1 with 671B parameters and 18.9%) but also the 32B students are much larger than the teacher itself (37.6% vs. 34.4% using R1). We have released code, research reports and open models to support the broader innovation of AI.

The role of reinforcement learning in inference models

With the modern rise of LLMS with advanced inference capabilities such as the DeepSeek R1 model, we have leveraged a powerful technique called Renforce Learning (RL). Through RL, expensive LLMs learn to solve complex mathematics, coding, and logical problems from scratch, improving by trial and error making it more possible (“enhanced”) the correct attempts of your past. Although it is very effective, there are important drawbacks to this approach. Most notably, RL training models tend to be narrow and focused. This means that they are good at the tasks they are trained to, but they cannot generalize to a wider range of applications.

To avoid this limitation, researchers often use a two-stage “solving learning” training process. First, a large-scale teacher model is trained in RL to solve the problem. The output is then carefully filtered and reused as training data for the student model, which becomes the final product. This new second phase above is often referred to as distillation or cold start. However, two major issues further constrain this process.

First, inference-oriented RL training can actually be applied only to models that can already solve challenging tasks. This limits applicability only to the most expensive teacher LLMS.
Second, there is a significant misalignment between the teacher's objectives during RL training and the intended role during testing. In “Solving Learning,” teachers are independently trained to solve problems from scratch, rather than producing clear and informative output suitable for teaching student models.

Learn to teach through reinforcement learning

To overcome the limitations of “solving learning,” we present a new class of models inspired by how Real Real Teachers work. RLT is given both questions and correct answers for each problem at the input prompt, so that good teachers do not need to rediscover mathematics theorems and not explain them. Their job is to connect dots with useful, step-by-step explanations that student models can learn.

What makes this approach powerful is how teachers are trained. RLT is trained to maximize clarity and instruction in explanation, similar to how teachers evaluate students' understanding in the classroom. Specifically, given the explanation of the teacher's problem, if the student model can easily understand the correct solution, it is a signal that the teacher did a good job. Quantify student understanding using “log probability.” This is a metric similar to the student's clarity in understanding the lesson. See the paper for full technical details.

Our new “learning to teach” approach addresses both problems in the traditional “learning solution” framework. First, our new training loop will make it much more effective to tailor teacher training to its true purpose (help students for distillation/cold start). Second, by supplying both the question and its correct answer, we can use a small and efficient teacher model that is not otherwise able to solve the problem yourself. Ultimately, these properties make our methods faster, cheaper and more effective in generating powerful reasoning students.

The unreasonable effectiveness of small professional teachers

We took an approach to testing by comparing small RLT models with the most known methods in the field, using only 7 billion parameters. These competing methods use much larger models such as Deepseek R1 and QWQ, combined with additional help from tools such as GPT-4O-MINI to clean up the output before training your student model.

Still, our much smaller RLT outperformed them across multiple challenging benchmarks in mathematics and science (see table below, top group). Using the same QWEN 2.5 student model, same questions, same rating setup, RLT produced better results with less computational effort. When teaching inference to language models, we set new criteria for both efficiency and effectiveness.

The results were just as impressive as when we expanded our students. The 7B teachers have successfully trained a 32B student model, more than four times its own size. This shows that small, professional teachers can transfer deep reasoning skills to even a much larger number of students.

Performances for the 2024 American Invited Mathematics Examination (AIME), Competitive Mathematics, and Graduate Level Q&A Benchmark (GPQA). Our RLTs earn performance improvements and complement traditional reinforcement learning for problem solving.

We also found that our approach complements traditional RL. When used as a starting point, RLT helped the student model reach even higher levels of performance (see plot below). And from a cost perspective, the differences are dramatic. Training of 32B students took less than a day on a single computing node, while traditional RL took several months on the same hardware.

Average performance across the 2024 American Invited Mathematics Examination (AIME), Competitive Mathematics, and Graduate Level Q&A Benchmark (GPQA). Our RLTs earn performance improvements and complement traditional reinforcement learning for problem solving.

Qualitative testing reveals the differences between the description provided by RLTS and the distillation marks from Deepseek R1. You can see that the output of this traditional RL model often appears to rely on external tools such as calculators, and also includes off-line linguistic patterns such as humorous comments. In contrast, our RLT description is more focused, and we can even add additional logical steps omitted by R1 using a clear and direct language. These intuitive reinforcements lead to improved learning of student language models, reflecting the brevity and clarity of professional educators.

Compared to the traces of inference in Deepseek R1, the RLT description avoids confusing the language and adds additional logical steps to help students.

Future: A new frontier for more advanced and cheaper inference models

Our RLT framework rethinks how we build inference models. Rather than training models to solve problems from scratch, train them to clearly explain known solutions, just like skilled human educators. This shift allows RL to be applied to areas that language models find too difficult to directly handle.

RLT can destroy the training costs of advanced models. Instead of relying on large systems at every stage, we can train small and specialized teachers and use them to efficiently teach much larger models. This inverts the traditional scaling paradigm. The heaviest tasks are handled by a compact and affordable model that unlocks the powerful features of the students to train.

The framework hints at something even more interesting in the future. This model plays both the teacher and the student role at once. By generating explanations for its own benefit, such a system can learn how to teach itself better over time. This idea reflects the vision of the Darwin Gädel machine, where models evolve through self-reflection and recursive learning.