Bytedance’s open-weight Helios model enables near real-time 1-minute AI video generation

AI Video & Visuals


Helios is the first 14B video model to reach 19.5 FPS on a single GPU while creating one minute of video. The code and model weights are public.

Currently, most video generation models only produce 5-10 second clips, which can take several minutes to render. The long video real-time approach relies on a much smaller 1.3 billion model that struggles with quality. Larger models like the Krea-RealTime-14B reach up to 6.7 FPS with H100, but suffer from severe drift artifacts.

Helios is built on Wan-2.1-14B and takes about 50 minutes to generate a 5-second video on A100. Training occurs in three stages: Helios-Base (architecture and anti-drift), Helios-Mid (token compression, 1.05 FPS), and Helios-Distilled, which maximizes speed by reducing computation to just 3 steps.

In developer benchmarks, the refined version of Helios reaches 19.53 FPS, which is even faster than some much smaller models. SANA Video Long has 2 billion parameters and is roughly 1/7th the size, but can only manage 13.24 FPS.

Bar chart comparing throughput in frames per second between video generation models on a single H100 GPU. With 14 billion parameters, Helios-Distilled reaches 19.53 FPS, which is roughly comparable to the 22.13 FPS of some 1.3 billion models such as Reward Forcing. Other 14B models such as Wan 2.1 and LongCat-Video only reach 0.33 FPS.
Helios-Distilled clocked in at 19.53 FPS on a single H100, matching the speed of the much smaller 1.3B model, while other 14B models can drop below 1 FPS. |Image: Yuan et al.

When it comes to video quality, the Helios received an overall score of 6.00 for short 81-frame videos. The authors state that this size outperforms any distillation model and is on par with most basic models. In the long video, Helios scored a 6.94, beating previous leader Reward Forcing’s 6.88. A user study with 200 participants supports that finding.

Two bar graphs comparing the quality of video generation models. The left shows the score for a long video. Helios-Base leads with a score of 6.57, while Helios-Distilled has a score of 6.34, ahead of smaller models like the LongLive with a score of 6.22. The right shows the score for a short video. HV Video 1.5 leads with 6.90 and Helios-Distilled outperforms all distilled models with 6.00.
Helios-Base tops the long video quality rankings and holds its own against significantly larger base models when it comes to short video generation. |Image: Yuan et al.

Longer produced videos typically lose quality, color consistency, and content consistency over time. Previous models have approached this problem with complex techniques such as self-enforcement, where the model feeds back its own output as input during training to bridge the gap between training and inference. Helios skips them all.

Four image pairs illustrating typical drift problems in video generation. Position shifts distort the spatial layout, and color shifts push colors into unnatural ranges. Restorative shifts occur in two forms. Noise breaks up the image into grainy artifacts, while blurring causes a progressive loss of detail. The left side of each pair shows the original image, and the right side shows the degraded result.
Three typical drift patterns in long video generation: position shifts, color shifts, and restoration artifacts that appear as noise or blur. |Image: Yuan et al.

Instead, the authors identify three typical drift patterns and suggest simpler fixes. Relative position coding prevents the model from hitting unknown position indices in long videos, which can lead to repetitive behavior. The first frame anchor always keeps the starting frame in memory and provides the model with a visual reference point to prevent color shifts. Targeted perturbation simulations during training make the model more resilient to its own errors that snowball over time.

One model handles text, image, and video inputs

Helios uses a unified architecture that supports text-to-video, image-to-video, and video-to-video conversion in a single framework. The model automatically switches tasks based on the contents of the previous context.

If the context is empty, the model is generated from text. If only the last frame is present, it acts as an image animator. If multiple frames are available, the existing video will continue. Users can also swap out text prompts mid-generation. Gradual crossfades between old and new prompts prevent unpleasant visual interruptions.

Structural diagram of Helios. On the left is a hierarchical memory structure that compresses the historical background into long-term, medium-term, and short-term. In the center are expression controls for switching from text to video, image to video, and video to video. On the right side, DiT blocks with guidance self-attention and guidance cross-attention.
Helios’ architecture compresses historical context across three timescales and automatically switches between text, image, and video inputs. |Image: Yuan et al.

The model was trained in three stages using 800,000 short video clips, each less than 10 seconds long. The resolution is currently the highest at 384 x 640 pixels, but flickering artifacts still occur during segment transitions. There are no open benchmarks for real-time long videos, so the researchers built their own test dataset called HeliosBench with 240 prompts.

Aggressive compression reduces computational costs

Helios achieves speed goals without using common acceleration techniques such as KV caching, sparse attention, or quantization. Instead, the model aggressively compresses the input data at two levels.

A hierarchical memory structure divides the video history into three time scales. Newer frames have less compression, older frames have more compression. This reduces the number of tokens to process by a factor of 8.

The multi-stage sampling process reduces the tokens of the generated video segments by a factor of 2.29. Early steps are performed at low resolution, and only later steps fill in details. Combining these two techniques reduces the computing cost to about the same level as producing a single image.

Three line graphs comparing the naive approach and Helios as the context length increases. On the left, you can see that the simple approach increases the number of tokens by over 17,000, while Helios remains below 2,500. The center shows GPU memory usage. The naive approach results in an out-of-memory error at context length 6. On the right, the inference time per step reaches 20 seconds for the naive approach, while Helios stays under 5 seconds.
As context length increases, the number of tokens, memory usage, and inference time increase linearly for the naive approach, while Helios remains roughly flat. The standard approach runs out of memory when the context length is 6. Image: Yuan et al.

The dedicated distillation technique also reduces the number of computational steps required per video segment from 50 to 3. Unlike previous approaches, Helios only uses real video data as context and generates only one segment per training step. GAN-like adversarial training objectives improve quality beyond the limitations of supervised models.

Token compression allows Helios to be trained through the first two stages on a single GPU. The third stage requires four complete models to run simultaneously, but thanks to various memory optimizations, these fit into 80 GB of GPU memory. According to the researchers, custom compute kernels for common operations speed up both training and inference by about 14 percent over standard implementations.

Helios is available as an open weight model on GitHub and Hugging Face, which also hosts a live demo. A sample of the generated video is available on the project page. This project is strictly for research purposes and has no plans to integrate into Bytedance products.

Bytedance recently made waves with Seedance 2.0, a multimodal video generation model that processes images, video, audio, and text all at once. Seedance requires significantly more computing and is limited to 15 second clips, but provides much higher visual quality. The quality was so high that alarm bells were rung in Hollywood about the possibility of large-scale copyright infringement.



Source link