New critic model enhances AI perception and reasoning

Machine Learning


Researchers are addressing a critical gap in artificial intelligence by developing more robust and reliable critic models for evaluating complex AI systems. Tianyi Xiong, Shihao Wang, and Guilin Liu from the University of Maryland, College Park, along with Yi Dong, Ming Li, Heng Huang, Jan Kautz, and Zhiding Yu, present PhyCritic, a new multimodal critique model designed specifically for physics AI tasks. This research moves beyond the general visual domain, focusing on areas that require perception, causal inference, and planning, and represents a major advance achieved through collaboration between researchers at the University of Maryland, College Park, and other institutions. PhyCritic employs a two-stage reinforcement learning pipeline to enhance both directed perception and judgment stability, clearly improving performance on multimodal judgment benchmarks, and enhancing perceptual and reasoning abilities on grounded tasks.

In this study, we introduce PhyCritic, a multimodal critic model optimized for physics AI through a two-stage reinforcement learning with visual reasoning (RLVR) pipeline. Unlike traditional visual recognition tasks, physical AI demand models must interpret complex multi-view observations, understand object affordances, infer causal relationships, and evaluate how hypothetical actions would play out in real-world environments. This paradigm includes safety-critical areas such as 3D perception and spatial grounding, robot-centric interaction understanding, and autonomous driving. As these systems scale, multimodal evaluation becomes increasingly important to measure physically accurate, visually grounded, and humanized inferences in the models. Despite advances in multimodal large-scale language models (MLLMs), the development of reliable multimodal critic models has been slow, and existing reward and judge models mainly focus on general areas such as captioning, STEM reasoning, and image question answering. Evaluation of physical AI is fundamentally different and requires assessing the plausibility of causal relationships, adhering to physical constructs, and respecting temporal, spatial, and dynamic constraints. Recent work extends multimodal judge and RL-based critic training to physical scenarios, and early efforts like DriveCritic highlight the importance of domain-specific judgment abilities. However, existing critics lack physical awareness and are often unable to distinguish between visually consistent but physically impossible inferences, and their training data focuses on extensive multimodal evaluation rather than physically grounded scenarios involving manipulation, affordance inference, or embodied 3D interactions. They are not making decisions based on their own physical understanding of the question, which can lead to inconsistent verdicts. The goal of this research is to fill this gap by developing a new class of multimodal critics designed specifically for physical AI, aiming to evaluate grounded, stable, and physically correct multimodal responses, including physical perception, causal inference, and evaluation of actions and plans. PhyCritic introduces the principle that a strong physics critic acts like an expert human judge, solving the problem itself before evaluating the responses of other models, encouraging fine-tuning of the self-referential critic. This tweak employs a two-stage RLVR framework, starting with a physical skills warm-up phase and applying standard Guided Rollout Policy Optimization (GRPO) to a small number of body-related question-answer pairs to strengthen core physical perception and reasoning abilities. Stage 2 trains critics to generate their own internal inferences and predictions for questions, and then uses GRPO with both critic and self-prediction rewards to evaluate candidate responses with explicit reference to this self-prediction to promote stable behavior and physics-aware coherent reasoning. To rigorously evaluate decision performance in a physical context, researchers introduced PhyCritic-Bench, a new benchmark built from various embodied datasets, including RoboVQA, BridgeData V2, HoloAssist, and AgiBot World. PhyCritic-Bench includes high-quality physical reasoning questions derived from Cosmos-Reason1 and paired candidate answers scored by verifiable ground truth, allowing you to fine-grained evaluations of inference correctness, visual evidence, and causal plausibility. The main contribution of this work is the introduction of a self-referential critic learning framework that explicitly grounds evaluation on the model’s own physical awareness and reasoning, implemented in a two-stage RLVR + GRPO pipeline. We also developed PhyCritic, a multimodal critic specialized in evaluating perception, causal inference, and planning in physics AI scenarios, and built a high-quality physics critic dataset across diverse reification domains with paired candidate responses and verifiable preference labels. Across the physical inference benchmarks Cosmos-Reason1, CV-Bench, and EgoPlan-Bench2, and the popular reward benchmarks VL-RewardBench and Multimodal RewardBench, PhyCritic clearly outperforms all open-source 7B/8B baseline models. These results demonstrate that critic models greatly benefit from a self-referential physical basis and that physical AI requires a new generation of physics-aware multimodal judge models. The continued pursuit of truly intelligent artificial systems requires more than just bigger models. We need robust methods to evaluate their inferences. This work on PhyCritic represents a subtle but important advance beyond general image evaluation toward critics who specialize in the complexities of physical understanding. For many years, AI evaluations have relied on human judgment and proxies that are easily fooled by superficial correlations. Building critics who can discern how AI arrives at its answers, its causal reasoning, and its grasp of physics has proven to be significantly difficult. This internal benchmark is a smart way to improve consistency and accuracy, and addresses a major weakness in many current rating systems. Although PhyCritic is good at determining the AI’s response, it is still a learned model itself and is susceptible to biases present in the training data. Furthermore, focusing on physical AI narrows the scope. Assessing creativity, nuanced expression, or ethical considerations still requires a different approach. In the future, a real possibility is to integrate such expert critics into broader evaluation frameworks, potentially leading to automated systems that can not only score the output of an AI, but also diagnose its weaknesses and guide further training and development. The ultimate goal is not just to build AI that performs better, but to build AI that thinks better, and that requires a much sharper eye than we currently have.



Source link