How our team optimizes our infrastructure to minimize latency for AI video processing

Over the past year, AI-generated video diffusion models have dramatically increased visual realism, as we’ve seen with OpenAI’s Sora 2, Google’s Veo 3, Runway Gen-4, and more. AI video generation has truly reached an inflection point, and the latest models can create stunning clips with lifelike visuals.

However, the way they are built does not allow these models to be used interactively and in real-time, and when most AI practitioners talk about AI video, their main focus is on producing clips for later viewing. For many, the idea of taking live video input from a camera and instantly transforming the output using AI is still years away.

Most of the obstacles here are architectural in nature and stem from the fact that these models sequentially generate chunks of video frames through a series of complex and computationally intensive steps. The model must process each chunk before it can start working on the next chunk, which inevitably introduces delays and makes live streaming AI video impossible.

Decart’s team decided to see if these obstacles could be circumvented. A recently released model, LSD v2, validated this idea that achieving minimum delay is primarily a matter of approach. To make it work, we have developed and implemented many cutting-edge techniques that we believe can be applied to various AI models.

Using these techniques, we were able to optimize the underlying infrastructure required to run the model and maximize GPU utilization while speeding up the denoising process needed to prevent error accumulation. LSD v2 employs a causal autoregressive architecture that generates video instantaneously and continuously with no output time limit.

Here’s how:

infinite generation

For a video model to produce output on a streaming basis, it must behave “causally.” This reduces the computational load as each new frame is generated based solely on the previous frame.

The causal video model uses an “autoregressive” structure to ensure continuity. Although this technique works well for short clips, the quality of the output degrades over time due to “error accumulation.” This means that small details such as slightly out-of-place shadows are exaggerated with each new frame, gradually destroying the consistency of the output.

Error accumulation is a major headache for video model developers and is the main reason why mainstream video models can only produce short sequences of a few seconds. To overcome this, we improved a technique known as “diffusion forcing” that allows us to remove noise from each frame as it is being generated. We combined diffusion forcing with “history augmentation” that trains the underlying model to predict and recognize corrupted outputs.

The result is a causal feedback loop in which, for each new frame, the model takes into account the current input frame and the user’s prompts, in addition to the previous frames it has already generated, allowing it to rapidly predict what the next output in the sequence will be.

This gives the model the ability to identify and correct input artifacts that appear in the output, preventing the accumulation of errors. Therefore, it can output an unlimited amount of high-quality content while continuously adapting as users enter new prompts, enabling real-time editing and transformation.

Less than 1 second delay

The most difficult problem we faced was not the quality of the video, but how to process the causal feedback loop fast enough for real-time generation.

To use AI video interactively, new frames must be generated with less than 40ms latency. Otherwise, the delay will be noticeable to human eyes. However, causal AI models are computationally intensive, and their design is at odds with the architecture of modern GPUs, which favor large batch execution over low latency.

We experimented with several new approaches to circumvent these obstacles. We started optimizing the underlying Nvidia Hopper GPU infrastructure to accelerate processing power. We focused on modifying the kernel, a small program that runs on each GPU and performs the individual steps involved in the computation. A single GPU typically runs hundreds of these small kernels, so the kernels are constantly stopped and started, and data is passed back and forth between them. This wastes a lot of time and means a large portion of your GPU will be idle.

Our solution to this was to optimize the kernel for Hopper’s behavior. Essentially, we created a single “mega-kernel” that allows the chip to process all of the model’s computations as a whole in a single continuous pass. This eliminates all stops, starts, and data movement, freeing up more GPUs for more time and speeding up processing by orders of magnitude.

We think of this as similar to how Henry Ford transformed manufacturing by moving cars down assembly lines, dramatically shortening production times. Instead of one team struggling to integrate all the components one by one and constantly stopping and starting, it can be completed much faster because the vehicle moves sequentially from one workstation to another.

pruning and distillation

Another important innovation we implemented is “architecturally aware pruning.” This involves performing a series of optimizations at the system level to reduce the amount of computation required to generate the output.

We are able to do this because neural networks tend to be “over-parameterized,” featuring a large number of parameters that are not needed to produce the desired output. Removing these unnecessary parameters means less work is required on the GPU and also helps in adapting the model’s architecture to that of the underlying hardware.

Finally, we came up with a trick called “shortcut distillation”. This means fine-tuning smaller, lighter models to match the denoising speed of larger models that require more processing power.

Using a shortcut model for denoising allows you to generate consistent video frames in fewer steps, and these incremental gains add up quickly, significantly reducing the time it takes to create high-quality output.

AI video game changer

Sub-second latency is a major advance for AI video generation, paving the way for its use in interactive scenarios not previously possible. Continuous editing allows you to generate content that evolves during creation, based entirely on the user’s whims.

TikTok influencers and Twitch streamers can start broadcasting live videos and adapt their content while streaming by typing prompts that come to mind or incorporating suggestions from their viewers.

This could have implications for live video games, allowing for interactive AI-generated sequences that transform based on actions taken by the player. For example, a gamer may be presented with a series of doors and asked to select one, and that choice may lead to a unique outcome. The potential use cases in augmented reality, immersive education, and large-scale event marketing are equally exciting.

AI-generated video also acts as a neural rendering engine for engineers, allowing them to use prompts to completely change the style of different products and experiences. Architects and interior designers can quickly iterate through different themes to see what works best before deciding which direction to go in.

What’s even more interesting is that the elimination of lag, combined with the ability to generate unlimited videos, allows anyone to explore the depths of their imagination and create long-form content. You’ll be able to interactively adjust scenes, lighting, camera angles, and character expressions while generating a video. Open the door to a more dynamic creative experience that transforms the way you create stories.

Kfir Aberman is a founding member of Decart AI and leads the San Francisco office, driving real-time generated video research-to-product efforts. His work focuses on building interactive, personalized, real-time AI systems that blend research excellence with creative user experiences.

Source link