Logically named San Francisco based craft story We are experts in AI-generated videos of realistic humans. The company's CraftStory Model 2.0 platform iteration now produces lifelike, studio-quality, long-form videos featuring “humans” from a single image and script. Let's find out. It's an AI that, when provided with a written script, allows users to generate a five-minute video of a human speaking and moving from a single image file.
The company first unveiled its first video-to-video model in November 2025. This model allowed users to generate up to 5 minutes of video by animating still images with motion captured from “driving” videos (i.e., source footage of the base content).
Model 2.0 builds on CraftStory's existing model suite and introduces new features that eliminate the need for source footage. Companies can now create expressive, long-form videos starting with just photos and text while maintaining the same realism, continuity, and quality of performance that was previously only possible through video-to-video workflows.
Great technology, but who would want a service like this? The company says there's a huge market for training and demonstration videos that showcase people, places, and products where it's difficult to get into a studio and source footage material (including the models themselves) is difficult to obtain.
Video as a primary communication channel
“As video becomes the primary communication channel for businesses; [creative and commercial] Teams face common bottlenecks. Creating consistent, human-driven content at scale remains slow, expensive, and difficult to update. Short AI clips exist, but they often lack expressive movement, break down over time, or fail to maintain realism beyond a few seconds. ” says CraftStory in its technical documentation.
The company claims its Image-to-Video model addresses this gap by turning a single image into a complete performance driven solely by script or audio.
The system produces so-called “natural facial expressions” and natural-looking body language and gestures that “consistently evolve” over time. These factors make this technology suitable for creating “product explainers” (i.e. how-to videos and demonstrations), training videos, and even customer communication and educational content.
Script-driven video creation
“Image-to-video conversion is a huge step towards fully script-driven video creation,” said Victor Erukhimov, Founder and CEO of CraftStoryl. “You no longer need to record video to get a realistic human performance. If you have an image and something to say, Model 2.0 can transform it into a reliable, long-form video with gestures and expressions that fit your message.”
In image-to-video conversion, users upload a single image of a person and a script or audio track. CraftStory Model 2.0 then synthesizes a complete video performance, animating both the person and the environment with realistic lip-syncing, expressive gestures, and scene movements that match the rhythm of the conversation and emotional tone.
This model shares the same core architecture as CraftStory's video-to-video system, including advanced gesture generation algorithms that infer appropriate hand and body movements directly from audio. It has a high-fidelity lip-sync feature that can produce natural speech intelligibility over long sequences. Identity preservation services maintain a consistent look, emotion, and nuance across minutes of video.
According to CEO Erukhimov, “Model 2.0 also includes an advanced lip-sync system that turns any script or audio track into a realistic performance. Built-in gesture adjustment algorithms ensure that body movements naturally match the rhythm and emotion of speech, bringing human expressiveness to AI-generated content.”
CraftStory also introduces support for moving cameras. Model 2.0 now allows you to generate walk-and-talk videos of up to 80 seconds. In this video, a person moves naturally through the scene while speaking, and the camera tracks their movements. This allows for dynamic, cinematic shots that stand out from static video on camera. This feature is currently in beta and will be gradually rolled out to existing accounts.
Unique parallelized diffusion pipeline
At the core of Model 2.0 is a unique parallelized diffusion pipeline designed to extend human video generation beyond short clips. By processing different time segments simultaneously while enforcing global consistency, the system maintains visual consistency over minutes of footage. This is a key challenge in long-duration video synthesis.
Technical description: Certain implementations, source code, and proprietary algorithms are the intellectual property and proprietary of CraftStory. Also, they are parallelized. This means that the computational tasks involved in the diffusion process are divided and distributed among multiple processing units (such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), the latter of which are application-specific integrated circuits (ASICs) such as the Neural Processing Units (NPUs) developed by Google for neural network machines), and are executed simultaneously rather than sequentially. The central goal of this unique parallelized diffusion pipeline is to accelerate the inference or training process for large-scale diffusion models and reduce the time required to produce high-quality output.
The model was trained on high frame rate footage of real actors to capture subtle facial dynamics and expressive hand and body movements. This allows your image-to-video output to feel fluid and human, rather than static or robotic. Video can be produced in both 480p and 720p portrait and landscape formats, with optional upscaling to 1080p.
Coming soon to a screen near you
Looking ahead, CraftStory says it is evolving Model 2.0 toward a “fully automated” text-to-video workflow, with a focus on making marketing video creation faster, simpler, and more scalable for everyday use. The company's YouTube page provides entertaining viewing.

