Reasoning about videos is a major hurdle for artificial intelligence systems, with current multimodal language models often producing seemingly logical explanations that lack coherence or strong connections to the visual content itself. Mohammad Maas, Hanouna Rashid, and Fahad Shahbaz Khan of the Mohammed bin Zayed AI University, along with Salman Khan of the Mohammed bin Zayed AI University and the Australian National University, are tackling this problem by introducing a new approach that enhances consistency and grounded reasoning. Their study identified key weaknesses in existing models using diagnostic metrics that assess consistency between inferences and answers and reliance on visual cues, revealing a tendency to favor linguistic shortcuts over actual video content. To overcome this, the team developed a reinforcement learning technique to refine both the timing and logical flow of the inference, resulting in a model, Video-R2, that consistently achieved higher accuracy and reliability in understanding video content across a variety of benchmarks.
Early visual language models and approaches
In recent years, significant advances have been made in visual language models, systems that can understand visual information and combine it with textual explanations. Basic models such as LLaVA and MiniGPT-4 combine the strengths of large-scale language models with visual processing capabilities, while other models such as Ferret and InternVL3 focus on tasks such as identifying objects in images and videos. Video-LLaMA extends this functionality to understanding video content, and benchmarks such as MMMVU and MMVU are used to assess multidisciplinary and multimodal understanding. Researchers are also exploring techniques to improve reasoning, such as thought chains, thought trees, and ReAct, which encourage models to break down complex problems into smaller steps.
Understanding how these models perceive and recall spatial information, known as “thinking in space,” is also an important area of research, leveraging datasets such as ActivityNet-QA, Clevrer, Next-QA, MMMU, MLVU, and Video-Explorer to assess their performance on tasks such as video question answering and object detection. Reinforcement learning approaches such as Visionary-R1 and Perception-R1 are used to enhance visual reasoning capabilities, and evaluation frameworks such as Lmms-eval are essential for evaluating large-scale multimodal models. Ongoing research focuses on improving training data and model architectures, demonstrating the growing emphasis on multimodal learning, robust inference, and reliable evaluation in artificial intelligence.
Inference consistency with post-training adjustments
In this study, we introduce a novel reinforcement learning framework designed to improve video inference capabilities. Researchers observed that current models often prioritize verbal cues over actual visual content, producing inference traces that lack logical consistency. To address this, they developed two diagnostic indicators. One is Think Answer Consistency (TAC), which measures the consistency between reasoning steps and the final answer, and the other is Video Attendance Score (VAS), which measures the reliance on visual information. To improve temporal accuracy and inference consistency, the team implemented a two-step post-training process that begins with supervised fine-tuning to generate intermediate inference steps linked to the video timeline.
Then, guided by the newly developed temporal alignment reward (TAR), group relative policy optimization (GRPO) is applied. It evaluates the consistency between the predicted timestamp and the reference timestamp and facilitates accurate temporal inferences only if the inference and the final answer match. The resulting model Video-R2 was tested across 11 benchmarks and demonstrated improvements in both TAC and VAS, proving that improvements in temporal consistency and inference consistency lead to more reliable video understanding. The team also hand-picked the inference dataset to align with timestamps, providing a robust foundation for temporal alignment and inference monitoring.
Visually-grounded inference metrics reveal limitations
This work demonstrates advances in video inference and addresses the challenge of ensuring that models demonstrate logically sound and visually grounded reasoning processes. The researchers observed that current video inference systems often rely heavily on linguistic prior information rather than the actual visual content, leading to inconsistent inferences. To quantify this, they introduced two diagnostic metrics: thought-answer consistency (TAC) and video attention score (VAS), which measure consistency between reasoning and final answer and reliance on visual cues, respectively. Analysis across 11 benchmarks revealed a significant reliance on linguistic shortcuts, highlighting the need for improved visual foundations and logical consistency.
To address these limitations, the team developed a reinforcement learning framework that incorporates a novel temporal alignment reward (TAR). This two-step process begins with supervised fine-tuning, followed by reinforcement learning with TAR to facilitate temporally accurate and self-consistent inference. Results show significant improvements in both TAC and VAS across multiple benchmarks. Specifically, the developed system Video-R2 consistently achieves high scores, indicating improved inference quality. The team rigorously evaluated Video-R2 against existing models and consistently showed improvements in TAC, VAS, and overall accuracy, demonstrating that enhanced temporal and logical consistency leads to more reliable and grounded video inference.
Visual grounding improves video inference performance
This work addresses a key challenge in multimodal large-scale language models: inference about dynamic visual content in videos. The research team identified that existing models often produce seemingly convincing inference traces that are logically inconsistent or not sufficiently based on actual visual evidence. To quantify these issues, they developed two diagnostic metrics. One is thought-answer consistency, which assesses the consistency between the reasoning step and the final answer. The other is the Video Attention Score, which measures your reliance on visual cues. To improve video inference, researchers propose Video-R2, a reinforcement learning approach that combines supervised fine-tuning and a novel optimization technique guided by temporal alignment rewards. This two-step process facilitates temporally accurate and logically consistent inferences. Results show that Video-R2 consistently achieves higher scores in thought-answer consistency, video attention scores, and overall accuracy across several benchmarks, indicating that enhancing temporal consistency and reasoning consistency leads to more reliable video understanding.
