Modern video generators such as Sora 2, Seedance 2.0, and Veo 3.1 produce increasingly impressive clips. But new benchmarks from Tsinghua University confirm that visual quality and real-world understanding are two different things.
Rather than focusing on image quality, WorldReasonBench tests whether a model can take a starting scene and continue it in a physically, socially, logically, and informationally meaningful way.
Let’s consider a basic test case. Give the generator an image of an apple on a branch and tell it to drop the apple. The results may look great, with smooth movement, realistic textures, and nice lighting, but the physics are still fundamentally wrong. The apple may fly upwards, pop like a balloon, or fall straight down without bending. Standard quality metrics evaluate the realism of the video. WorldReasonBench is designed to capture that gap.

WorldReasonBench includes approximately 400 test cases across four areas: world knowledge (physics, weather, cultural norms), human-centered scenes (object handling, social interactions), logical reasoning (mathematics, geometry, scientific experiments), and information-based reasoning (reading data and diagrams).

Scoring is done in two stages. First, process-aware methods use structured questions to check whether the video is reaching the correct end state in a reasonable way. The second pass then evaluates the quality, temporal consistency, and visual aesthetics of the inference. Alongside the benchmark, the team also released WorldRewardBench, a dataset of approximately 6,000 video comparisons ranked by trained annotators.
Commercial models lead by a wide margin, but everyone is failing logically.
The researchers tested five commercial systems (Sora 2, Kling, Wan 2.6, Seedance 2.0, Veo 3.1-Fast) and six open source models (LTX 2.3, Wan 2.2-14B, UniVideo, HunyuanVideo 1.5, Cosmos-Predict 2.5, LongCat-Video). The commercial generator scored approximately twice as high as the open source model managed on core inference metrics, and there was no statistical overlap between the two groups.

ByteDance’s Seedance 2.0 came out on top, ranking first in nearly 9 out of 10 statistical reruns. Veo 3.1-Fast was best in terms of world knowledge, and Sora 2 was best in human-centric scenes. Seedance 2.0 also outperformed Veo 3.1-Fast, Kling, and Wan 2.6 in human evaluation.
More important than rankings are common weaknesses. Logical reasoning is the most difficult category for all models tested. Here, even the best commercial systems fall far short of the overall average, and most open source models fall short almost completely. Informed reasoning is the second most difficult area, especially when the task requires physically grounded transitions or accurate storage of text and numbers.

This study also introduces a metric that tracks the number of correct answers obtained from a dynamic, process-based phase rather than a static snapshot. Commercial models score much higher here, which shows that open source models are really lacking in what they lack: understanding cause and effect rather than how things look.
Open-source generators improve the most when the model gets more detailed prompts that detail step-by-step what should happen. They simply rely on faster quality than their commercial rivals, which itself may be a side effect of the more powerful inference capabilities of commercial models.
Automatic scoring is equivalent to human judgment
To validate their approach, the team compared the metrics to human video comparison rankings. The core metrics closely match human judgment and clearly outperform traditional AI judgment that compares videos in pairs.

This conclusion is consistent with a growing body of evidence. In other words, despite substantial advances in resolution, length, and controllability, the transition from pixel generators to reliable world models has not occurred. Getting there will likely rely less on visual sophistication and more on a better grasp of causal mechanisms and the ability to maintain information consistency over time. Benchmarks, data, and code are available on GitHub.
An international team of researchers recently reached a similar conclusion. Sora 2 and Veo 3.1 fall far short of human performance on inference tasks. Whether video generators qualify as a “world model” remains a matter of debate in AI research. Meta’s Yann LeCun thinks systems like Sora are a dead end, while DeepMind CEO Demis Hassabis thinks Google’s Veo is a step toward a global model. OpenAI shut down Sora as a commercial video generator, but the team kept it around and focused on researching world models. The proposed definition, called OpenWorldLib, explicitly excludes pure text-to-video models from this category.
