AI learns how to self-correct and reduce false claims using internal knowledge

Researchers are increasingly recognizing the potential of large-scale language models that can encode abstract concepts within their learned features. Goodfire AI’s Aaditya Vikram Prasad, Connor Watts, and Jack Merullo, along with Dhruvil Gala, Owen Lewis, and Thomas McGrath, are demonstrating a new application of these capabilities as a scalable monitoring source for open-ended tasks. Their work tackles the critical problem of hallucinations in language models by introducing RLFR, a reinforcement learning pipeline that uses feature exploration to identify and correct uncertain claims. This approach not only significantly reduces hallucination rates and achieves a 58% improvement over Gemma-3-12B-IT, but also provides a path toward more interpretable and controllable artificial intelligence systems and represents a paradigm shift in how model understanding can be leveraged to improve learning.

Leverage internal fact representations to reduce language model illusions

Researchers have developed a new way to reduce inaccuracies in large-scale language models by leveraging internal features representing concepts such as facticity. This study introduces RLFR (Reinforcement Learning from Feature Rewards), a pipeline that reuses these internal model features as a scalable reward system for open-ended tasks.

Traditionally, such features have been used to monitor and manipulate model behavior during testing, but this work shows their potential as direct monitoring signals during training. The core of the innovation lies in translating a model’s internal “beliefs”, measured through a scrutiny framework, into dense and cost-effective rewards for reinforcement learning.

This pipeline specifically targets the persistent problem of hallucinations in language models, teaching them how to identify and correct potentially incorrect statements. By identifying candidates for hallucination claims, the system trains the model to intervene and adjust its response when uncertainty about factual accuracy is detected.

Furthermore, the use of feature-based rewards enables efficient and scalable computation during testing, leading the model to more reliable outputs. Operating this process on the Gemma-3-12B-IT model resulted in a clear 58% reduction in hallucination-prone policies compared to the original model while maintaining performance on established benchmarks.

This study introduces a new paradigm in which monitoring is done in the language of model features rather than relying on external validation. A key component is a decomposed probe protocol that monitors hallucinations and rewards the model for subsequent retraction and correction. This approach has proven to be approximately 90 times cheaper per intervention than using ground truth monitoring sources, offering significant advantages in terms of computational cost. This work paves the way for more reliable language models that can tackle complex and open-ended tasks by effectively utilizing internal representations.

Reinforcement learning from internal model features for hallucination reduction

Decomposed probing protocols that exploit model features supported research to reduce model hallucinations. The pipeline first monitored potential hallucinations by analyzing features of the internal model, and then made retractions and modifications conditional on addressing identified hallucinations.

Specifically, this study implemented a reinforcement learning (RL) pipeline called RLFR to exploit these features as a reward function. The work centered around Gemma-3-12B-IT, embodying an approach that creates policies that are clearly less prone to hallucinations. Experiments revealed that the resulting policy reduced hallucination output by 58% compared to the original model while maintaining the performance of established benchmarks.

Trait-based rewards have proven to be an efficient alternative to external raters, with approximately 90 times lower cost per reward intervention compared to ground truth supervision. Additionally, this work extended beyond enabling RL by utilizing feature-based rewards that facilitate scalable test-time computations.

Standard techniques such as Best-of-N sampling were adopted to improve the performance of the trained policies. This includes leveraging reward functionality to select the most reliable completions from the set of generated outputs. This work highlights that features encode abstract concepts such as facts and intentions and have traditionally been used for monitoring and manipulation, but proposes their use as scalable monitoring for open-ended tasks. This framework represents a new paradigm in interpretability research, positioning features as monitoring signals to intentionally design models with desired features.

Reinforcement learning from internal features significantly reduces hallucinations in large-scale language models

A new reinforcement learning pipeline achieved a 58% reduction in hallucination rates. This study introduces RLFR (Reinforcement Learning from Feature Rewards), which exploits the features of the internal model as a reward function for open-ended tasks. A new investigative framework identifies candidates for hallucinated claims and allows the model to intervene and correct completion when factual uncertainty is detected.

The pipeline operated on Gemma-3-12B-IT clearly reduced hallucinogenic responses while maintaining performance on established benchmarks. This work establishes the basis for verbal monitoring of model features and represents a paradigm shift in exploiting interpretability to learn complex behavior.

This work focuses on mitigating hallucinations, a persistent challenge in large-scale language models, by reinforcing factuality through reward signals derived from internal feature readouts. These feature readings are tuned to reflect the model’s confidence in the validity of the claims, providing a dense and inexpensive monitoring signal.

The resulting policy shows a 58% reduction in hallucination propensity and significantly improved factual accuracy compared to the original model. Additionally, the pipeline facilitates scalable test-time computations, also based on reward features extracted from the model. The research framework effectively measures the model’s beliefs about concepts related to downstream tasks, such as the factual accuracy of statements.

This allows for the creation of reward signals that directly address open-ended behavior without relying on costly external validation. By repurposing these features as fine-tuned supervision, this work avoids the limitations of using large-scale language models as judges, which can be time-consuming and poorly tuned.

This study introduces new applications of model functionality beyond its traditional use of monitoring and steering during testing. This approach enables reinforcement learning for behaviors that are difficult or impossible to directly verify, and opens the possibility for trained models to exhibit more desirable and complex properties. The success of the pipeline with Gemma-3-12B-IT demonstrates the potential for broader applications across a variety of open-ended tasks and model architectures.

Reducing hallucinations by reinforcement learning from learned reward features

Large-scale language models demonstrate the ability to learn features that represent abstract concepts such as factual accuracy and intent. These features are typically used to monitor performance and guide the behavior of the model in use. Recent research introduces alternative applications that utilize these features as a form of scalable monitoring for more general tasks.

Specifically, this work addresses the problem of reducing the production of hallucinatory instances, i.e., factually incorrect statements, as a desirable but difficult behavior for these models. A reinforcement learning pipeline called RLFR was developed to utilize these learned features as a reward function.

This pipeline incorporates new methods for identifying potentially hallucinatory claims, allowing the model to intervene and correct its output when uncertainty about factual correctness is detected. Furthermore, the system allows efficient computation during use based on reward features obtained from the internal representation of the model.

Applying this process to the Gemma-3-12B-IT model resulted in a policy that reduced hallucination rates by 58% without compromising performance on established benchmark tests. This study introduces a novel approach that exploits interpretability within language models to facilitate learning of open-ended tasks.

An important finding is that the ability to encode abstract concepts within a language model can be reused as a reward for training the model to reduce factual errors. This presents a path to improving the reliability of generated text without the need for extensive human annotation or task-specific training data.

The authors acknowledge that the current pipeline relies on an investigative framework to identify hallucination claims, which is not perfect and may introduce its own biases. Future research directions include exploring more robust methods to identify and correct inaccuracies and extending this approach to other open-ended tasks beyond hallucination mitigation.

Source link