Move images without hassle: Text2Video-Zero is an AI model that transforms text-to-image models into zero-shot video generators

Source: https://arxiv.org/abs/2303.13439

The last few months have seen the rise of generative AI models. They moved very quickly from producing low-resolution face-like images to producing high-resolution photorealistic images. By describing what we want to see, we are now able to obtain unique and photorealistic images. Additionally, perhaps more impressive is the fact that the diffusion model can be used to generate videos.

A major contributor to generative AI is diffusion models. Takes a text prompt and produces output that matches its description. They do this by gradually transforming a set of random numbers into an image or video while adding details to make it look like a description.These models learn from datasets containing millions of samples. So you can generate new visuals similar to what you’ve seen before. However, sometimes datasets can be a significant issue.

Training a diffusion model for video generation from scratch is not feasible in most cases. They require very large datasets and equipment to meet their needs. The construction of such datasets is only possible in a few institutions around the world. Accessing and collecting this data is costly and out of reach for most people. You should use an existing model to make it work for your use case.

🚀 Check out 100 AI Tools in the AI Tools Club

Even if you manage to prepare text and video datasets with millions, if not billions, of pairs, you get the hardware power needed to feed those massive models. have to find a way. Therefore, the high cost of video diffusion models makes it difficult for many users to customize these technologies to their needs.

What if there was a way around this requirement? Is there a way to reduce the cost of training a video diffusion model? Time to meet Text2Video-Zero

Text2Video-Zero A zero-shot text-to-video generative model. That means you don’t have to customize your training. Take a pre-trained text-to-image model and transform it into a temporally consistent video generation model. Finally, the video quickly displays a series of images to inspire movement. The idea of using them in sequence to generate a video is a simple solution.

However, you can’t use the image generation model hundreds of times and combine the outputs at the end. This doesn’t work because there is no way to guarantee that the model will always draw the same object. We need a way to ensure the temporal consistency of the model.

To ensure temporary consistency, Text2Video-Zero Use two lightweight modifications.

First, we enrich the latent vectors of the generated frames with motion information to keep the global scene and background time consistent. This is done by adding motion information to the latent vector rather than randomly sampling the latent vector. However, these potential vectors are not sufficiently constrained to express a particular color, shape, or identity, resulting in temporal discrepancies, especially for foreground objects. Therefore, a second change is required to address this issue.

The second fix concerns the attention mechanism. To leverage cross-frame attention and at the same time leverage a pre-trained diffusion model without retraining, each self-attention layer is replaced with cross-frame attention and each frame’s attention is focused on the first. To do.this helps Text2Video-Zero Preserves the context, appearance, and identity of foreground objects throughout the sequence.

Experiments show that these modifications produce high-quality, temporally consistent videos, even though they do not require training on large-scale video data. Furthermore, it is not limited to text-to-video synthesis, but can also be applied to conditional and special video generation, video editing with text instructions.

check out paper and githubdon’t forget to join Our 19k+ ML SubReddit, cacophony channeland email newsletterWe share the latest AI research news, cool AI projects, and more. If you have any questions about the article above or missed something, feel free to email me. Asif@marktechpost.com

🚀 Check out 100 AI Tools in the AI Tools Club

Ekrem Çetinkaya has a Bachelor’s degree. He completed his master’s degree in 2018. In 2019, he graduated from Ojegin University in Istanbul, Turkey. he wrote his master’s degree. A paper on image denoising using deep convolutional networks. He is currently pursuing his Ph.D. He holds a degree from Klagenfurt University in Austria and works as a researcher for the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.

Source link