CraftStory, a pioneer in human-centric video generated by artificial intelligence, today announced the release of its first image-to-video model that allows users to generate videos up to 5 minutes long.
This new feature expands on the company's existing video-to-video model, called Model 2.0, which was launched in November 2025.
As more companies leverage video as a form of communication, image-to-video workflows have the potential to drive use cases such as marketing and advertising, business communications, and educational content. Enabling teams to create consistent “on-camera” human performances without traditional production.
Currently, most video generation models struggle to produce consistent footage longer than 10-30 seconds. To create long videos, users often stitch together short clips to create a long story, but additional generations can be staggered, producing slightly different faces, costumes, lighting, or motion, leading to coordination issues. With advanced AI workflows and tools, videos can be longer than two minutes, but longer stories can quickly dissolve into algorithmic chaos.
CraftStory achieves long-form features using a unique parallel spread pipeline that processes different segments simultaneously. This approach allows the platform to enforce consistency between clips and maintain visual consistency across minutes of footage.
“Image-to-video conversion is a big step towards fully script-driven video creation,” says Victor Erkhimov (pictured), founder and CEO of computer vision startup Itseez Inc., which he sold to Intel. “You no longer need to record video to get realistic human performance.”
Erkhimov said Model 2.0 allows users to start with just an image and achieve human immersion in longer videos with gestures and expressiveness that match the message.
The model was trained using high frame rate footage of real actors, capturing the dynamics of facial expressions, hand movements, and body language. According to CraftStory, this allows for the creation of faithful human “actors” that feel fluid and authentic, rather than static or robotic.
Video can be produced in both portrait and landscape formats at 480p and 720p, and can be upscaled to 1080p for high-end output. The company also introduced support for moving cameras, allowing users to create walk-and-talk videos of up to 80 seconds with natural movement throughout the scene.
Users can create videos from a single image and a script or audio track. The system generates scenes according to a script, with AI actors lip-syncing while built-in gesture adjustments aim to keep body movements natural and match the rhythm and emotion of speech.
Image: Craft Story
Support our mission of keeping content open and free by joining the theCUBE community. Join theCUBE's Alumni Trust Networka place where technology leaders connect, share intelligence, and create opportunities.
- over 15 million viewers of theCUBE videospowering conversations across AI, cloud, cybersecurity, and more
- 11.4k+ theCUBE Alumni — Connect with over 11,400 technology and business leaders who are shaping the future through our trusted, unique network.
About SiliconANGLE Media
Founded by technology visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach more than 15 million elite technology professionals. Our new, proprietary theCUBE AI Video Cloud leverages theCUBEai.com neural networks to deliver breakthrough advances in audience interaction, helping technology companies make data-driven decisions and stay at the forefront of industry conversations.
