AI video tools and how they’re changing business communications [Q&A]

AI News


robot speaker

use of AI video It has increased explosively in the past year. But while it's deepfakes that grab the headlines, the technology also has the potential to change the way companies create and use video content in messaging.

To learn more about AI video and how businesses can benefit from it, we spoke to Victor Erukhimov, CEO of CraftStory.

BN: How do you think AI video tools will change the creation and use of video content for businesses?

VE: AI video tools allow businesses to significantly reduce the effort and time required to create video content. This ranges from everything from ads and launch videos to explainers, tutorials, and educational content. What once required a dedicated team, motion graphics, and custom animation can now be created in hours and updated on demand.

At the same time, the demand for video has exploded. With mobile-first consumption and social media shaping the way people learn and buy, businesses need to communicate through video now more than ever.

But there's a problem. Most in-camera videos are boring, repetitive, and look the same in every company. they intermingle. More dynamic things, such as animated sequences, product walkthroughs, and cinematic shots, are usually too expensive and time-consuming to produce at scale.
This is exactly what CraftStory changes. We enable businesses to create rich, expressive, human-centered videos that go far beyond talking content without the high cost and complexity of traditional production. As a result, brands can stand out, publish more often, and build video channels that actually grow.

AI video doesn't just speed up production. Enables entirely new forms of storytelling that were previously out of reach for most businesses.

BN: Most current AI models are still focused on short clips. Why are 5-minute long videos a difficult problem for the industry to solve?

VE: Long-form videos are fundamentally difficult because diffusion models struggle to maintain consistency over long timelines. If you're trying to generate several minutes of footage in a single pass, your model will require an enormous amount of training data, memory, and compute just to keep the character's appearance, gestures, and environment stable. After a certain duration, the video starts to wobble. Faces shift, lighting changes, and movements become inconsistent. That's why most models today are maxed out at short clips.

Our research team solved this problem by rethinking video generation time. Rather than forcing a single diffusion process to cover a long interval, we split the video into short segments and run multiple diffusion processes in parallel. At the same time, maintain character identity, movement, and visual consistency across all segments. This allows you to scale to minutes instead of seconds without losing quality.

There's also the practical challenge of rendering time. A 5-minute video is orders of magnitude more computationally intensive than a 20-second clip. Long render times make iterative work very slow for creators.

We've put a lot of effort into optimizing our pipeline so creators can actually iterate. Today, a one-minute video can be generated in about 30 minutes, allowing you to adjust scripts, shots, or gestures.

This means that long-form videos are difficult because they require both algorithmic breakthroughs and extensive engineering optimizations. And that's exactly where we set our goals and innovations.

BN: There is a big debate about data sources for AI. How important is it to train a video model using curated footage or your own footage rather than scraping internet content?

VE: For us, this made a big difference. We built a unique multi-camera capture system that records synchronized high frame rate (HFR) footage from multiple angles. This allows you to capture the subtle dynamics of human movement that standard 30 fps internet video misses.

For example, human fingers move incredibly fast and appear blurry at 30 fps. This means that a model trained on that footage will not be able to learn the correct movements. High frame rate synchronous capture produces crisp, detailed hand and facial movements, dramatically increasing motion realism.

As a result, the data you use for training is clean, consistent, and physically accurate, allowing you to train much better models with much less data.

BN: Many companies are turning to AI video for training, marketing, and product demos. What use cases do you think will see real adoption first?

VE: It's already being introduced in training and educational videos. AI avatars have become popular in L&D because they allow teams to easily and instantly update content without reshooting, and consistency is critical in a training environment.

The next wave is product demos, explanations, and lightweight marketing videos. These are high-volume, repetitive formats that companies need to provide updates quickly, localize across markets, and keep messaging consistent. AI video is a perfect fit.

Early experiments are also being conducted in advertising. The most notable example was Coca-Cola's controversial Christmas ad, which showed the industry that brands were actively considering AI in high-stakes creative work, even if the execution was still evolving.

BN: The AI ​​video market already feels crowded, with both big tech and startups entering the space. How do you expect this field to evolve in the coming years?

VE: The AI ​​video market seems crowded right now, but most of what we're seeing is an explosion of tools and not a complete solution for brands. Generating clips is easy. Producing a real marketing video still requires multiple iterations, creative decisions, and a team that knows how to put all the pieces together. The gap between the “model” and the “finished video” is where most current products fall short.

The winners over the next few years will be the platforms that truly simplify end-to-end video creation. Brands want to go from idea to script to finished video without the need for a motion graphics team, director, or someone to stitch the assets together in post-production.

That's exactly what we're focused on. We are building a system where models follow both the script and the high-level director's instructions. This also includes dynamic camera movement in shots such as walk-and-talk. Upcoming text-to-video models will allow creators to specify tone, pacing, framing, gestures, and camera choreography in natural language.

image credits: Phong Lamai/depositphotos.com





Source link