Bytedance has released Seedance 2.0 to a limited group of users. The previous model was already one of the most powerful AI video generators available. The new version is even more advanced.
Multimodal video generation models process up to four types of input at once: image, video, audio, and text. Users can combine up to 9 images, 3 videos, and 3 audio files for a total of 12 files. The generated video is between 4 and 15 seconds long and automatically includes sound effects or music.
Demo videos come directly from ByteDance and are almost certainly hand-picked from a large batch of generated clips. No one yet knows how consistently this model will reach this quality standard in real-world use, how much it will cost, or how long it will take to produce. So what we’re looking at is probably the best-case scenario. And while these features look impressive on paper, there are still major hurdles to implementing them into professional workflows, such as consistency. Still, the quality on display is truly impressive.
Prompt: The camera follows a man in black who quickly flees. A large group of people follows him from behind. The camera switches to a horizontal tracking shot. The figure panicked and knocked down a roadside fruit vendor, got up and continued running. The excited cries of the crowd can be heard in the background.
Message: A girl is gracefully hanging the laundry. When you’re done putting it on, take out the next item from the bucket and shake it out vigorously.
According to ByteDance, the standout new feature is the reference feature. Models can take camerawork, movement, and special effects from uploaded reference videos, swap characters, and seamlessly extend existing clips. Video editing tasks like replacing and adding characters also work.
The user creates a simple text command like, “Get @image1 as the first image in the scene. First person view. Get camera movement from @Video1. The scene above is based on @Frame2, the scene on the left is based on @Frame3, and the scene on the right is based on @Frame4.”
The user records the movement of the camera…
…the AI model transfers this, along with other elements, to the generated video.
For compliance reasons, we currently block realistic human faces in uploaded materials. Seedance 2.0 is only available as a beta version on Jimeng’s official website (jimeng.jianying.com).
Prompt: The person in the photo has a guilty expression on their face, their eyes darting from side to side, then they lean away from the picture frame. She quickly reached out of the frame, reached for the Coke and took a sip, a satisfied expression on her face. At this time, footsteps can be heard. The person in the photo quickly returns Cora to its original location. A western cowboy comes, takes the Coke out of the cup, and leaves. Finally, the camera moves forward and the background slowly fades to black, with only a spotlight from above illuminating the Coke can. Cleverly designed subtitles appear at the bottom of the screen along with the narrator’s voice. “Yikou Cola – Must try!”
This release comes days after competitor Kuaishou announced its Kling 3.0 model, which also supports multimodal input and output. AI video competition is also heating up in China’s stock market. According to the South China Morning Post, the launch of these powerful video models has boosted the stock prices of Chinese media and AI companies by as much as 20%.
AI News Without the Hype – Curated by Humans
as The Decoder Subscriberyou can read without ads. Weekly AI Newsletterexclusive “AI Radar” Frontier Report 6 times a yearaccess to comments, and Complete archive.
Subscribe now
