STIV: Scalable Text and Image Conditional Video Generation

While the field of video generation has brought in amazing advances, there remains an urgent need for clear and systematic recipes that can guide the development of robust and scalable models. This work systematically explores the interactions of model architectures, training recipes, and data curation strategies, and presents a comprehensive study that culminates in a simple, scalable, text image conditioning video generation method called STIV. Our framework integrates image conditions into a diffusion transformer (DIT) via frame replacement and incorporates text conditioning via guidance without a joint image text conditional classifier. This design allows STIV to perform both text-to-video video (T2V) and text-image-to-video tasks simultaneously. Additionally, STIV can be easily extended to a variety of applications such as video prediction, frame interpolation, multiview generation, and long video generation. Comprehensive ablation studies on T2I, T2V and TI2V make STIV strong performance despite its simple design. With a resolution of 512, the 8.7B model achieves 83.1 on the VBench T2V, surpassing both major open and closed source models such as the Cogvideox-5B, Pika, Kling and Gen-3. Models of the same size achieve cutting-edge results of 90.1 on VBench I2V tasks at 512 resolution. By providing transparent, scalable recipes for building cutting-edge video generation models, we aim to empower future research and accelerate our advances in more versatile and reliable video generation solutions.