PhyCo: Bridging physics and video generation

Modern video diffusion models are great for visual synthesis, but have difficulty capturing the nuances of physical interactions. Objects float unrealistically, collisions lack proper repulsive physics, and matter reactions often defy the laws of physics. This gap limits its applicability to scenarios where veracity is required.

Introducing PhyCo: Continuous, Grounded Physical Control

The PhyCo framework, detailed on the project page, addresses this critical limitation by introducing continuous, interpretable, physically-grounded control over video generation. This is achieved through a multi-pronged approach that leverages new datasets and innovative training methodologies. The researchers present a large dataset consisting of more than 100,000 photorealistic simulation videos in which parameters such as friction, restoring, deformation, and force are systematically varied across a variety of scenarios. This dataset forms the basis for training models to understand and reproduce physical behavior.

Physics-based fine-tuning and VLM-based optimization

At the core of PhyCo is a physics-based fine-tuning process. The pre-trained diffusion model is enriched using a ControlNet conditioned on pixel-aligned physical property maps. This integration allows you to incorporate physical properties directly into the model generation process. Additionally, VLM-guided reward optimization is employed, and a fine-tuned visual language model evaluates the generated video based on the physical query of interest. This provides differentiable feedback and allows iterative improvement of the physical realism of the generative model. Importantly, this method allows PhyCo video generation to produce physically consistent and controllable outputs by varying physical attributes without requiring explicit simulator or geometry reconstruction during inference.

Beyond synthetic data: Scalable and versatile video generation

PhyCo’s impact is demonstrated through its performance on the Physics-IQ benchmark, which significantly outperforms existing strong baselines in physical realism. Human studies have further validated this framework, confirming that physical attributes can be controlled with greater clarity and fidelity. This work presents a scalable path to generative video models that not only achieve physical consistency, but also generalize effectively to real-world scenarios and go beyond the limitations of purely synthetic training environments. Advances in PhyCo video generation represent a significant step toward more reliable and controllable AI-generated content.

© 2026 StartupHub.ai. Unauthorized reproduction is prohibited. Please do not type, scrape, copy, reproduce or republish this article in whole or in part. Use for AI training, fine-tuning, search enhancement generation, or as input to any machine learning system is prohibited without a written license. Substantially similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer abuse laws. See our Clause.

Source link