Large-scale language models (LLMs) often produce hallucinations, unsupported content that undermines reliability. Although most previous studies frame hallucination detection as a two-task task, many real-world applications require determining the extent of hallucinations, which is a multi-step decision-making process. This naturally raises the question whether explicit inference can help with the complex task of detecting hallucinatory periods. To answer this question, we first evaluate pre-trained models with and without Chain of Thought (CoT) inference and show that CoT inference is likely to produce at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that encourages inference with span-level reward functions. RL4HS is built on group-relative policy optimization and introduces class-aware policy optimization to alleviate the reward imbalance problem. RAGTruth benchmark experiments (summarization, question answering, and data-to-text) show that RL4HS outperforms pre-trained inference models and supervised fine-tuning, demonstrating the need for reinforcement learning with span-level rewards to detect hallucinatory spans.
- † National Taiwan University, Taiwan
