Chinese technology giant Meituan has released a new LongCat-Video model that it claims represents a breakthrough in text-to-video generation by producing consistent high-resolution clips of up to five minutes. The company has open sourced the model on GitHub and Hugging Face to support broader research collaboration.
According to Meituan, LongCat-Video is built on a diffuse transformation (DiT) architecture and supports three modes: text-to-video, image-to-video, and video continuation. This model can transform text prompts or a single reference image into smooth 720p/30 fps sequences, or extend existing footage into longer scenes with consistent style, motion, and physics.
The researchers said their model addresses a persistent challenge in generated video: maintaining quality and temporal stability over long periods of time. LongCat-Video can generate several consecutive minutes of content without the typical frame degradation that affects most diffusion-based systems.
Meituan described LongCat-Video as a step toward “world model” AI that can learn real-world geometry, semantics, and motion to simulate physical environments. This model is publicly available through Meituan’s repositories on GitHub and Hugging Face.

