Google Veo 3 is an impressive video generation model recently announced by Google, which has sparked widespread excitement over the Internet. Its ability leaves behind many wonders, and some even call it a scary good. The model features audio synthesis and cinema tools, setting new benchmarks for AI-powered video generation.
The world of high-tech has celebrated the launch of Google's VEO 3, but Byte Dance has quietly released something even better. Tiktok's parent company recently published a research paper on Seedance 1.0, a bilingual video generation model that outweighs independent leaderboards in both text-to-video and image-to-video generation.
Bytedance did not launch on events or demos. Instead, its technical benchmarks put the company in the spotlight without serious marketing efforts. This model is built to support high-resolution multi-shot generation while maintaining rapid inference and strict instruction compliance.
How Seedance 1.0 crushes Veo 3
The company introduced the technology in its research paper, “separating spatial and temporal layers with interleaved multimodal position encoding. This allows models to collaborate on both text and video in a single model, allowing native support for multi-shot video generation.”
This approach allows AI models to support complex scene transitions and multi-shot storytelling with consistent thematic representations.
An important part of the model's performance comes from the data pipeline in bytes. The team curated a large multi-source dataset with detailed bilingual captions and dense annotations of motion and static features. Caption accuracy was prioritized to improve rapid compliance during generation. This was combined with a new reinforcement learning setup using three reward models focusing on basic alignment, motion quality and aesthetics.
In evaluation, Seedance 1.0 outperformed VEO 3 in multiple dimensions. In the SeedVideObench benchmark, designed in collaboration with a film director, this model showed a higher score for prompt follow and motion realism.
In particular, in the image-to-video task, seadance retained more visual consistency from the input frame, while Veo 3 showed occasional changes in lighting and texture, the research paper argues.
Inference performance is another prominent aspect. In terms of speed, seed run 1.0 leaves the rest. The company claims that it produces 5-second video at 1080p in just 41.4 seconds on a single NVIDIA-L20, making it an order of magnitude faster inference time than rivals such as the SORA, Runway Gen-4 and of course the VEO 3.
ByteDance also said it reduced costs and latency in a way that could push video generation towards real-time use cases.
Furthermore, the AI model has toped the leaderboard charts in artificial analysis for both intertext and interim images generation tasks.
Reevaluate VEO 3 for comparison
VEO 3 remains a technically ambitious system. We introduced audio-recognized video integration, allowing users to control camera movement and shot configuration via flow tools. Early user responses emphasized the novelty of synchronized dialogue and dynamic environments, placing it at the forefront of audiovisual production.
However, in a direct comparison, the VEO 3 appears to lack visual alignment and frame consistency. The Seedance 1.0 research paper noted that Veo image-to-video results can change subject appearance and scene lighting, affecting overall effectiveness. VEO managed to expand the modality of generated videos, but it lags in performance on traditional benchmarks.
In contrast, Seedance 1.0 focuses on visual consistency and motion validity with structured reinforcement learning and curated fine-tuning data. Its strength lies in reliability and controllability, particularly important scenarios for multi-shot or long-term sequencing, and for creating professional or semi-automated content.
Seedance 1.0, scheduled for June 2025 integration between platforms such as Doubao and Jimeng, is ready to become an important productivity tool. Its purpose is to significantly improve professional workflows and regular creative tasks.
Veo 3 attracted attention as the first to combine realistic video with ambient sounds and dialogue, while Seedance 1.0 achieved better visual fidelity, motor stability and narrative consistency, but no audio capabilities.
