How does an AI model generate video?

But you don't want images – you want images you specify, usually use a text prompt. Thus, the diffusion model pairs with a second model that guides each step in the cleanup process, such as a large-scale language model (LLM), trained to match the image with the textual description, pushing the diffusion model towards the image that the large language model considers as a prompt.

A side note: this LLM does not pull the link between text and image from the thin air. Most text-to-image text-to-video models today are trained on a large dataset that includes billions of text and images, text and videos that have been scraped from the internet (a practice that many creators have very unfortunate). This means that what comes from such models is the distillation of the world represented online, distorted by bias (and pornography).

It is easiest to imagine a diffusion model that works with images. However, this technique can be used with a wide range of data, such as audio and video. To generate a movie clip, the spreading model must clean up a sequence of images (continuous frames of video) with just one image.

What are potential diffusion models?

This requires a huge amount of calculations (read: energy). Therefore, most diffusion models used for video generation use a technique called latent diffusion. Instead of processing the raw data (multiple million pixels of each video frame, the model works in what is called latent space, where the video frames (and text prompts) are compressed into mathematical code that captures only the important features of the data and discards the rest.

The same thing happens every time you stream a video over the internet. The video is sent from the server to the screen in compressed form, allowing it to reach it faster, and when it arrives, it is converted into a video that can be viewed by your computer or TV.

Source link