Helios is the first 14B video model to reach 19.5 FPS on a single GPU while creating one minute of video. The code and model weights are public.
Currently, most video generation models only produce 5-10 second clips, which can take several minutes to render. The long video real-time approach relies on a much smaller 1.3 billion model that struggles with quality. Larger models like the Krea-RealTime-14B reach up to 6.7 FPS with H100, but suffer from severe drift artifacts.
Helios is built on Wan-2.1-14B and takes about 50 minutes to generate a 5-second video on A100. Training occurs in three stages: Helios-Base (architecture and anti-drift), Helios-Mid (token compression, 1.05 FPS), and Helios-Distilled, which maximizes speed by reducing computation to just 3 steps.
In developer benchmarks, the refined version of Helios reaches 19.53 FPS, which is even faster than some much smaller models. SANA Video Long has 2 billion parameters and is roughly 1/7th the size, but can only manage 13.24 FPS.

When it comes to video quality, the Helios received an overall score of 6.00 for short 81-frame videos. The authors state that this size outperforms any distillation model and is on par with most basic models. In the long video, Helios scored a 6.94, beating previous leader Reward Forcing’s 6.88. A user study with 200 participants supports that finding.

Longer produced videos typically lose quality, color consistency, and content consistency over time. Previous models have approached this problem with complex techniques such as self-enforcement, where the model feeds back its own output as input during training to bridge the gap between training and inference. Helios skips them all.

Instead, the authors identify three typical drift patterns and suggest simpler fixes. Relative position coding prevents the model from hitting unknown position indices in long videos, which can lead to repetitive behavior. The first frame anchor always keeps the starting frame in memory and provides the model with a visual reference point to prevent color shifts. Targeted perturbation simulations during training make the model more resilient to its own errors that snowball over time.
One model handles text, image, and video inputs
Helios uses a unified architecture that supports text-to-video, image-to-video, and video-to-video conversion in a single framework. The model automatically switches tasks based on the contents of the previous context.
If the context is empty, the model is generated from text. If only the last frame is present, it acts as an image animator. If multiple frames are available, the existing video will continue. Users can also swap out text prompts mid-generation. Gradual crossfades between old and new prompts prevent unpleasant visual interruptions.

The model was trained in three stages using 800,000 short video clips, each less than 10 seconds long. The resolution is currently the highest at 384 x 640 pixels, but flickering artifacts still occur during segment transitions. There are no open benchmarks for real-time long videos, so the researchers built their own test dataset called HeliosBench with 240 prompts.
Aggressive compression reduces computational costs
Helios achieves speed goals without using common acceleration techniques such as KV caching, sparse attention, or quantization. Instead, the model aggressively compresses the input data at two levels.
A hierarchical memory structure divides the video history into three time scales. Newer frames have less compression, older frames have more compression. This reduces the number of tokens to process by a factor of 8.
The multi-stage sampling process reduces the tokens of the generated video segments by a factor of 2.29. Early steps are performed at low resolution, and only later steps fill in details. Combining these two techniques reduces the computing cost to about the same level as producing a single image.

The dedicated distillation technique also reduces the number of computational steps required per video segment from 50 to 3. Unlike previous approaches, Helios only uses real video data as context and generates only one segment per training step. GAN-like adversarial training objectives improve quality beyond the limitations of supervised models.
Token compression allows Helios to be trained through the first two stages on a single GPU. The third stage requires four complete models to run simultaneously, but thanks to various memory optimizations, these fit into 80 GB of GPU memory. According to the researchers, custom compute kernels for common operations speed up both training and inference by about 14 percent over standard implementations.
Helios is available as an open weight model on GitHub and Hugging Face, which also hosts a live demo. A sample of the generated video is available on the project page. This project is strictly for research purposes and has no plans to integrate into Bytedance products.
Bytedance recently made waves with Seedance 2.0, a multimodal video generation model that processes images, video, audio, and text all at once. Seedance requires significantly more computing and is limited to 15 second clips, but provides much higher visual quality. The quality was so high that alarm bells were rung in Hollywood about the possibility of large-scale copyright infringement.
