AI video generator tested to understand how the world works

Researchers are increasingly focused on whether generative video models truly understand the underlying principles that govern the physical world. Mingxin Liu and Tencent Youth Lab at Shanghai Jiao Tong University, along with Shuran Ma and Shibei Meng at Beijing Normal University, introduced RISE-Video, a new benchmark designed to assess a model’s ability to infer and adhere to implicit world rules during video generation. This study significantly advances evaluation by investigating the cognitive reasoning capabilities of text-to-video models, rather than simply assessing their visual appeal. Comprised of 467 annotated examples and a novel multidimensional evaluation protocol, RISE-Video provides a rigorous testbed for assessing intelligence across domains such as common sense reasoning and spatial mechanics, ultimately providing critical insights to guide the development of more realistic and intelligent generative models.

Although current models excel at creating visually realistic videos, their ability to understand and accurately simulate implicit world rules remains largely unknown.

This study addresses this critical gap by shifting the focus from aesthetic qualities to deep cognitive inferences within video synthesis. RISE-Video consists of a meticulously curated dataset of 467 human-annotated video samples, spanning eight different inference categories, including common sense, spatial mechanics, and specialized subject areas.
This framework introduces a multidimensional evaluation protocol that utilizes four key metrics: inference consistency, temporal consistency, physical rationality, and visual quality. This holistic approach ensures that the generated videos comply not only with visual plausibility but also with the underlying cognitive and physical constraints determined by the input instructions.

To facilitate scalable evaluation, an automated pipeline leveraging large-scale multimodal models (LMMs) was developed to emulate human-centered evaluation based on inference-aware questions and prompts. Extensive experiments conducted on 11 state-of-the-art text-to-video models reveal widespread deficiencies in simulating complex scenarios governed by implicit constraints.

These findings provide important insights to advance the development of future generative models that can more accurately simulate the world. Validation confirms a high degree of agreement between the LMM-based evaluation pipeline and human judgment, suggesting its potential as a reliable and cost-effective alternative to large-scale human evaluation.

This benchmark consists of eight reasoning dimensions, including empirical reasoning, common sense reasoning, temporal reasoning, social reasoning, perceptual reasoning, spatial reasoning, subject-specific reasoning, and logical reasoning. This taxonomy comprehensively covers inference situations in video synthesis, from low-level perceptual cues to high-level abstract reasoning. This study shows that current systems struggle with basic inference tasks and highlights a clear need for improvements in rule-aware evaluation and model development.

Building the RISE-Video dataset and defining inference categories

This study is backed by a dataset of 467 samples meticulously annotated by humans and is designed to rigorously evaluate the inference capabilities of Text-Image-to-Video (TI2V) synthetic models. This dataset, called RISE-Video, is divided into eight different categories of inferential knowledge, each targeting a specific aspect of video understanding and production with structured constraints.

These categories include common sense knowledge, subject matter knowledge, perceptual knowledge, social knowledge, logical ability, experiential knowledge, and spatial knowledge and provide a comprehensive testbed for model intelligence. Within the scope of Commonsense Knowledge, this study evaluates models on aspects such as footprint formation, skin response to mosquito bites, and dental caries progression, and assesses understanding of everyday physics, biological responses, and health habits.

Subject knowledge is assessed across physics, chemistry, geography and sport, probing understanding of principles from electricity and chemical reactions to river formation and soccer shooting techniques. Perceptual knowledge is assessed through manipulation of size, color, number, position, and occlusion, and the robustness of the visual basis of the generated video is tested.

Assessment of social knowledge focuses on recognizing emotions from facial expressions, following social rules such as proper waste disposal, and reflecting cultural customs such as food traditions. Logical skills are tested through game action, puzzle solving, and geometric reasoning, which require reasoning based on structured constraints.

Experiential knowledge is scrutinized by assessing the ability to infer intent from cues, identify individuals, understand sequences of steps, and apply contextual knowledge. Finally, spatial knowledge is assessed through perspective transformation, object placement, and structural reasoning, reflecting the importance of 3D understanding in video generation.

To facilitate scalable evaluation, an automated pipeline leveraging large-scale multimodal models (LMMs) was implemented to emulate human-centered evaluation by utilizing four metrics: inference consistency, temporal consistency, rationality, and visual quality. This framework enables extensive experiments on 11 state-of-the-art TI2V models, uncovering widespread deficiencies in simulating complex scenarios under implicit constraints, and providing important insights for evolving world simulation generative models.

Inference performance across different video generation scenarios

The RISE-Video benchmark consists of 467 meticulously annotated human samples across eight inference categories. These categories include a wide variety of scenarios and provide a structured testbed for evaluating model intelligence across dimensions such as common sense and spatial mechanics. This framework introduces a multidimensional evaluation protocol consisting of inference consistency, temporal consistency, rationality, and visual quality.

This approach ensures that the generated video adheres to the cognitive and physical constraints mandated by the input instructions. To facilitate scalable evaluation, an automated pipeline was developed that leverages large-scale multimodal models that emulate human-centered evaluation. Experiments conducted on 11 state-of-the-art Text-Image-to-Video models reveal widespread deficiencies in simulating complex scenarios under implicit constraints.

We found that logical ability accounted for 83 samples in the benchmark, and common sense knowledge accounted for 50 samples, focusing on core reasoning abilities. Spatial knowledge was represented in 33 samples and social knowledge in 78 samples, indicating a broad coverage of reasoning types. The data distribution further detailed experiential knowledge in 23 samples, perceptual knowledge in 1 sample, and temporal knowledge in 3 samples.

Puzzle solving included 17 samples and geometric reasoning included 14 samples, highlighting the inclusion of more complex cognitive tasks. Analysis of video length revealed a distribution of medium length videos with 19 samples, short videos with 15 samples, and long videos with 14 samples. The proposed evaluation pipeline is highly consistent with human judgment, suggesting that LMM-based evaluation can serve as a reliable and cost-effective alternative to large-scale human evaluation.

Inference flaws limit generation of complex scenarios in text-to-video models

A new benchmark, RISE-Video, goes beyond simple visual fidelity assessment to systematically evaluate the inference capabilities of text-to-video generative models. The benchmark consists of 467 meticulously annotated video samples across eight categories designed to test a variety of reasoning skills, including common sense understanding, spatial awareness, and expert knowledge.

The evaluation uses four metrics: inference integrity, temporal consistency, rationality, and visual quality to provide an overall evaluation of the generated videos. A key component of this effort is an automated evaluation pipeline that leverages large-scale multimodal models to mimic human judgment and enable scalable, detailed analysis.

Extensive testing of 11 state-of-the-art text-to-video models reveals consistent weaknesses in simulating complex scenarios governed by implicit rules, despite generally good visual quality. The authors acknowledge that there are potential biases in the automated assessment pipeline related to acceptance rates. This can overestimate the quality of near-perfect output and prevent accurate differentiation between high-quality and defective videos.

These findings highlight the significant gap between achieving visually realistic videos and ensuring consistency with the underlying world rules in current generative models. The development of RISE-Video is intended to facilitate more rigorous evaluation of text-to-video systems and encourage future research focused on designing and training models that prioritize inference ability along with visual fidelity. Further research could consider extending the benchmark with more complex inference scenarios and refining the automatic evaluation pipeline to address identified biases.

Source link