New AI system pushes time limits for generated videos

A team of EPFL researchers has taken a major step towards solving the problem of generated video drift, which causes sequences to become disjointed after a few seconds. Their breakthrough paved the way for AI video without time constraints.

Today, with the help of AI, anyone can create realistic images in just a few clicks. However, generating video is a much more complex task. Existing AI models can create videos that last less than 30 seconds before becoming less random due to inconsistent shapes, colors, and logic. This problem is called drift, and computer scientists have been working on it for years. At EPFL, researchers in the Visual Intelligence for Transportation (VITA) lab have developed a video generation method that essentially eliminates drift, taking a new approach that addresses errors rather than avoiding or ignoring them. Their method is based on reusing errors into AI models so that they can learn from their own mistakes.

Teach the machine to mess up

Due to the drift, the video becomes more unrealistic as it progresses. This occurs because generating video programs typically work by using the image you just created as a starting point for the next image. This means that any errors in that image (like a blurred face or a slightly deformed object) will be magnified in the next image, and the errors will only get worse as the sequence continues. “The problem is that models are only trained on complete datasets, but when used in real-world situations, they need to know how to handle inputs that contain their own errors,” says Professor Alexandre Alahi, director of the VITA Lab.

A new method invented at EPFL, called retraining with error recycling, successfully eliminates drift. The researchers start by having the model generate a video, and then measure the error in that video, or the difference between the generated image and the image that should have been generated, according to a variety of metrics. These errors are stored in memory. The errors are intentionally fed back into the system to force the model to behave under real-world conditions the next time the model is trained. As a result, the model gradually learns how to get back on track after displaying incomplete data, returning to images that are clear and follow a logical order to humans, even if the initial images were distorted. After being trained in this way, the model becomes more robust and learns how to stabilize the video after a defective image is generated. “Unlike humans, generative AI has little idea how to recover from failure, leading to drift. So we teach the model how to do that and how to remain stable despite imperfections,” said Wuyang Li, a postdoctoral fellow at the institute.

“Our method includes adjustments that make the output of the AI program more stable without requiring a lot of processing power or huge datasets,” Alahi says. “It’s like training pilots in rough weather instead of clear blue skies.” The method is integrated into a system called Stable Video Infinity (SVI), which can produce high-quality videos lasting more than a few minutes.

SVI, which is available as open source, has been tested by comparing a large number of SVI-generated videos to the same sequences generated by another AI system. It is scheduled to be presented at the 2026 International Conference on Learning and Representations (ICLR 2026), which will be held in April. Professionals from a variety of fields are interested in this technology, including audiovisual production, animation, and video games. “We have hard numbers that prove the effectiveness of our AI system,” Lee says. “Our work was featured by one of the AI community’s biggest YouTubers and received over 150,000 views and 6,000 upvotes within a few weeks. Additionally, our open source repository has received over 190,000 stars on GitHub, a code hosting site, demonstrating our influence within the community.” Additionally, this new method will help VITA Lab helps researchers design autonomous systems that are safer, more effective, and can interact seamlessly with humans.

Multimodal AI that combines video, images, and audio

VITA Lab experts have also developed another method called LayerSync using an error recycling approach. This will also be presented at ICLR. Using this method, the AI model not only recycles visible errors, but also its internal logic. “Some of the models are now able to better understand the meaning behind the images,” Alahi says. “LayerSync allows these more ‘specialized’ parts to guide the other parts during model training, as if the model were modifying it from within. As a result, the model trains faster because it uses its own signals to monitor the process without requiring additional data or external models. This produces higher quality content, whether it’s video, images, or sound. ”

Source link