Semantic soft bootstrapping enables long context inference in LLM without reinforcement learning, achieving gains of 10.6% and 10% more

The ability of large-scale language models to reason about complex problems greatly benefits from thought-chain reasoning, but training these models typically relies on computationally expensive reinforcement learning techniques. Purbesh Mitra and Sennur Ulukus at the University of Maryland are tackling this challenge with a new approach called semantic soft bootstrapping. Their method avoids the need for reinforcement learning by employing a self-distillation technique, where the model learns about the correctness of its responses from subtly different contextual cues. This process automatically generates training data from raw question-answer pairs, allowing the model to refine its inference process and achieve significant accuracy gains on difficult mathematical benchmarks, demonstrating significant advances in long-context inference capabilities without the limitations of traditional reinforcement learning techniques.

Logit monitoring improves LLM inference

Scientists have developed a new technique, SLiM (Supervised Logit Matching), to enhance the inference capabilities of large-scale language models after initial training. SLiM provides a simpler and more efficient alternative to techniques such as reinforcement learning by directly monitoring the model’s internal output score, known as a logit, and matching it with the score of a carefully designed “teacher” model. This supervised model produces outputs for both correct and incorrect solutions, encoding valuable inferential information even on flawed trials. The core of SLiM is an offline distillation process. This means no continuous interaction or human feedback is required during training.

The researcher presents both correct and incorrect answers to the teacher model, and the student model learns to focus on tokens in the answer sequence to match the teacher’s logic. Results show that SLiM significantly improves the performance of difficult inference benchmarks such as GSM8K, MATH500, and AIME2024, outperforming existing techniques. Importantly, SLiM achieves these benefits without the need to generate longer responses. The method begins by instructing the base model to generate multiple solution trials, or “rollouts,” for a given problem, then automatically classifying them as correct or incorrect and creating a curated dataset for subsequent training. SSB constructs a specialized prompt that combines the original problem statement with a representative correct solution and a contrasting incorrect solution. The base model acts as a “teacher”, producing a single, detailed and verified solution, refining and explaining the inference process.

Researchers extract token-level logits from the teacher model’s answers and store them as “soft labels” representing the probability distribution of possible answer tokens. During training, the student model learns to match the teacher’s token distribution using KL-based distillation loss to avoid reward hacking and steer the model’s output toward the correct response. Experiment using Qwen2. This study overcomes the limitations of traditional reinforcement learning techniques by effectively using the same basic model for both teachers and students and training the models based on their own generated inferences. The team curated a dataset of paired examples by processing a large number of questions, enabling efficient offline distillation without human intervention or online reinforcement learning loops. Experiment using Qwen2.

The 5-3B-Instruct model on the GSM8K dataset shows significant improvements in accuracy on difficult benchmarks, including significant improvements on the MATH500 benchmark and notable improvements on the AIME2024 benchmark. Detailed analysis of the training process reveals stable dynamics in which the loss and gradient norm gradually decrease over time. Notably, completion length did not significantly increase during training. This suggests that stronger inferences do not necessarily require longer thought chains or increased token usage. This method effectively uses the same model as both teacher and student and improves performance by training the model to learn from its own hinted inference process. This approach builds paired training samples from existing question-answer data and enables efficient offline extraction without the need for human annotation or reinforcement learning. In particular, this technique maintains stable training dynamics and does not require long response lengths, suggesting that improved inference does not necessarily depend on long thought chains. The authors acknowledge that further research is needed to investigate the sample efficiency and scaling laws of this method using larger models and broader datasets, and suggest extending this method to a wider range of domains.

👉 More information
🗞 Semantic Soft Bootstrap: Long Context Inference in LLM without Reinforcement Learning
🧠ArXiv: https://arxiv.org/abs/2512.05105

Source link