New benchmark confirms AI video generator looks great but still doesn’t understand the world

Modern video generators such as Sora 2, Seedance 2.0, and Veo 3.1 produce increasingly impressive clips. But new benchmarks from Tsinghua University confirm that visual quality and real-world understanding are two different things.

Rather than focusing on image quality, WorldReasonBench tests whether a model can take a starting scene and continue it in a physically, socially, logically, and informationally meaningful way.

Let’s consider a basic test case. Give the generator an image of an apple on a branch and tell it to drop the apple. The results may look great, with smooth movement, realistic textures, and nice lighting, but the physics are still fundamentally wrong. The apple may fly upwards, pop like a balloon, or fall straight down without bending. Standard quality metrics evaluate the realism of the video. WorldReasonBench is designed to capture that gap.

Four color-coded quadrants display WorldReasonBench's 22 task categories with sample images and prompts. Dimensions of world knowledge, human-centeredness, logical reasoning, and information-based group tasks such as domino toppling, car washing, logical puzzles, and diagram interpretation. — WorldReasonBench divides video generator ratings into four reasoning dimensions with 22 subcategories, ranging from physics mechanics to diagram logic. |Image: Wu et al.

WorldReasonBench includes approximately 400 test cases across four areas: world knowledge (physics, weather, cultural norms), human-centered scenes (object handling, social interactions), logical reasoning (mathematics, geometry, scientific experiments), and information-based reasoning (reading data and diagrams).

Two-part flowchart. At the top is the WorldReasonBench pipeline, which consists of a taxonomy, data collection with Qwen Image Edit, and prompt design with human supervision. The WorldRewardBench pipeline at the bottom features 13 video models, each with 8 generated videos, 15 annotators, and re-annotation when discrepancies are large. — The setup is divided into the WorldReasonBench task catalog and WorldRewardBench, a setup benchmark where 13 video models go head-to-head. |Image: Wu et al.

Scoring is done in two stages. First, process-aware methods use structured questions to check whether the video is reaching the correct end state in a reasonable way. The second pass then evaluates the quality, temporal consistency, and visual aesthetics of the inference. Alongside the benchmark, the team also released WorldRewardBench, a dataset of approximately 6,000 video comparisons ranked by trained annotators.

Commercial models lead by a wide margin, but everyone is failing logically.

The researchers tested five commercial systems (Sora 2, Kling, Wan 2.6, Seedance 2.0, Veo 3.1-Fast) and six open source models (LTX 2.3, Wan 2.2-14B, UniVideo, HunyuanVideo 1.5, Cosmos-Predict 2.5, LongCat-Video). The commercial generator scored approximately twice as high as the open source model managed on core inference metrics, and there was no statistical overlap between the two groups.

Below are three case studies. Veo-3.1 renders two rows of dominoes in a physically impossible way, and Seedance 2.0 animates the wrong mechanism for the gripper robot and fails to reproduce the rotational motion of the cable expected in the schematic. Red marks highlight each error. — Even videos that seem convincing fall apart upon closer inspection. Domino toppling, crane games, and simple circuits all trip up tested models. |Image: Wu et al.

ByteDance’s Seedance 2.0 came out on top, ranking first in nearly 9 out of 10 statistical reruns. Veo 3.1-Fast was best in terms of world knowledge, and Sora 2 was best in human-centric scenes. Seedance 2.0 also outperformed Veo 3.1-Fast, Kling, and Wan 2.6 in human evaluation.

More important than rankings are common weaknesses. Logical reasoning is the most difficult category for all models tested. Here, even the best commercial systems fall far short of the overall average, and most open source models fall short almost completely. Informed reasoning is the second most difficult area, especially when the task requires physically grounded transitions or accurate storage of text and numbers.

Table showing key results and overall scores for five closed-source video models and six open-source video models across four inference dimensions. Seedance2.0 leads with an overall Score_PR of 39.8 and Veo3.1-Fast achieves the highest individual score of 55.0 in World Knowledge, but no open source model has an overall score higher than 17.9. — Closed-source models such as Seedance 2.0 and Veo 3.1-Fast outperform their open-weight rivals by a factor of approximately 2x across all inference dimensions. |Image: Wu et al.

This study also introduces a metric that tracks the number of correct answers obtained from a dynamic, process-based phase rather than a static snapshot. Commercial models score much higher here, which shows that open source models are really lacking in what they lack: understanding cause and effect rather than how things look.

Open-source generators improve the most when the model gets more detailed prompts that detail step-by-step what should happen. They simply rely on faster quality than their commercial rivals, which itself may be a side effect of the more powerful inference capabilities of commercial models.

Automatic scoring is equivalent to human judgment

To validate their approach, the team compared the metrics to human video comparison rankings. The core metrics closely match human judgment and clearly outperform traditional AI judgment that compares videos in pairs.

A web interface for annotating logic puzzles. The input screen and prompts are displayed at the top. Below that is a grid of eight generated videos, each rated on a scale of 1 to 5 for inference accuracy, temporal consistency, and visual quality. — Fifteen trained annotators score eight anonymized model videos across three axes for each case. We don’t know which model made which video. |Image: Wu et al.

This conclusion is consistent with a growing body of evidence. In other words, despite substantial advances in resolution, length, and controllability, the transition from pixel generators to reliable world models has not occurred. Getting there will likely rely less on visual sophistication and more on a better grasp of causal mechanisms and the ability to maintain information consistency over time. Benchmarks, data, and code are available on GitHub.

An international team of researchers recently reached a similar conclusion. Sora 2 and Veo 3.1 fall far short of human performance on inference tasks. Whether video generators qualify as a “world model” remains a matter of debate in AI research. Meta’s Yann LeCun thinks systems like Sora are a dead end, while DeepMind CEO Demis Hassabis thinks Google’s Veo is a step toward a global model. OpenAI shut down Sora as a commercial video generator, but the team kept it around and focused on researching world models. The proposed definition, called OpenWorldLib, explicitly excludes pure text-to-video models from this category.

Source link