D-ID embeds AI agents in videos to make them interactive, while companies like Higgsfield AI are building agent video infrastructure for content generation.
Getty Images
There is always a quiet imbalance in the video. They often provide information with visual depth, tonal nuance, and narrative clarity, but the terms of their delivery are fixed. The burden of interpretation is on the viewer. No matter how sophisticated the surrounding ecosystem—recommendation systems, autoplay loops, short-form formats, etc.—becomes, the underlying interaction remains the same. Press play to watch and exit.
However, the rise of AI innovation is beginning to upend that model. When systems embed AI across digital products, they introduce the ability to respond, clarify, and adapt in real time. Text is already undergoing this transformation through conversational AI. Video was the exception until recently.
At the consumer level, AI can turn video into a near-conversation, reshaping the flow of information as viewers question, request context, and engage with content. At the production layer, AI compresses and reorganizes the creative process itself, replicating features once required for a full-fledged studio (camera systems, editing workflows, visual effects) and integrating them into programmable iterative pipelines.
Video is now moving beyond its role as a delivery format to serve as an operational layer where interaction, creation, and feedback are tightly coupled.
D-ID, a New York-based video creation and real-time interaction technology company, is tackling that constraint by redesigning how video works at its core. The company is introducing so-called “Agentic Video,” which embeds real-time AI agents directly into the viewing experience. The agent resides within the video layer itself, is anchored to the content, is aware of its context, and is designed to respond as part of the experience rather than along with it.
Viewers can interrupt and ask questions at any time. The agent processes queries in real-time, drawing from video scripts and connected knowledge sources to generate responses that are accurate and match the original message. The interaction does not end when the video ends. Agents are persistent, allowing viewers to continue exploring the topic after playback. This seemingly simple change changes the structure of experience. Video is no longer dictated by a fixed sequence of information, but by the viewer.
“A creator’s instinct is always to protect the narrative arc. Agents extend the story rather than interrupt it. The interaction layer is activated when the viewer chooses to do so, whether by asking a question mid-video or choosing a conversation that continues after the video ends. So the creator’s intent is maintained and the viewer’s need for clarity is also met,” Gil Perry, co-founder and CEO of D-ID, told me. “What we’re actually seeing is that the questions viewers ask reveal where the story wasn’t landing, which actually provides valuable feedback for creators.”
Convert video plays into engagement engines
The system is built on D-ID’s V4 expressive visual agent, which combines sub-second latency and human-like avatars capable of natural real-time conversations. In this model, the avatar becomes the interface itself, not just the presenter. Perry said the real change is conceptual, not just technical. For years, video success has been measured by views and completion rates, but these metrics mean little about whether the content actually resonated, influenced understanding, or inspired action. “The presenter in the video can now actually respond and use the questions that arise as a hook to deepen that initial level of interest.”
D-ID claims that the disconnect is already visible on a large scale. Businesses spend millions of dollars annually on video-based communications, yet engagement remains structurally broken. Comprehension and memory are inconsistent, and even short-form videos often capture only fragmented attention. Perry found that the video’s unidirectionality felt like a structural limitation.
D-ID’s work on agent video aims to fill that gap by reimagining video as a responsive system, one where interactions ultimately lead to impact. The change is already resonating with large enterprise customers, including Tata Group and Microsoft, who are experimenting with interactive, avatar-driven engagement. The company also introduced a new analytics layer that captures user questions and engagement, transforming video into a queryable data-generating system rather than a static asset.
“You’re capturing intent, not just behavior. Your audience asks, ‘Will this integrate with my CRM system?’ We told them that it was qualitatively different from the viewers who watched 87% of the video. It’s a buy signal, it’s a readiness signal, it’s a confusion signal, and it depends on when it appears in the experience and what happened before it,” Perry said. “Agent Video integrates intent signals across all viewer interactions, grouping them by theme, emotion, or moment of experience, and can surface patterns that are otherwise invisible. No longer guessing what will resonate, you can read it directly from the questions people can’t help but ask.”
Interactive AI avatar market is heating up
The interactive AI avatar market will become significantly more competitive in 2026 as both startups and platform giants rapidly concentrate on the core enterprise segment. According to a study by Precedence Research, the AI avatar market is expected to reach approximately $142 billion by 2034, growing at a CAGR of 31.95%. Major enterprise companies that make up this category include D-ID, HeyGen, Synthesia, DeepBrain AI, Soul Machines, UneeQ, and Microsoft.
Competition is segmented along functional areas. Companies like Tavus, HeyGen, and DeepBrain AI are advancing real-time conversational avatars designed for live interaction, while Synthesia continues to dominate scripted, enterprise-grade video production. Each of these approaches captures a different layer of the content stack. Similarly, major platform players such as Microsoft and NVIDIA are increasing their investments in digital human and AI infrastructure, indicating the category is moving from niche to foundational.
DeepBrain AI is narratively closest to D-ID, pushing real-time AI video agents into enterprise environments of financial services and large organizations. Yet, rather than redefining video itself as an interactive medium, it is structured around avatars as interactive assistants. Other players differentiate themselves along narrower dimensions. Beyond Presence focuses on rendering fidelity and low latency, while Life Inside focuses on reliability and analytics, combining real employee footage with conversational AI to extract engagement insights.
D-ID’s key differentiator is that it unifies these modes into a single continuous experience. The “observe-to-interact” continuum, where the presenter and agent are the same, eliminates the traditional handoff between content and chatbots and creates a more consistent, context-aware experience. This positioning is strengthened by D-ID’s integration with simpleshow following its acquisition in 2025, allowing the product to be integrated directly into corporate training, internal communications and customer education workflows. This is an advantage over API-first competitors.
“We are at a point where the interaction layer is of paramount strategic importance because it is where intent is expressed and decisions are made,” Perry said. “The benefit is that people have information that actually corresponds to their situation. The risk is that the same functionality may be used to narrow understanding rather than broaden it. That’s the mandate that comes with building an interface layer.”
Rebuilding the video production model
While D-ID focuses on interaction, Higgsfield AI is reimagining the production side of things, including how videos are produced, distributed, and tested. The agentic, generative AI-powered video platform, which gained early attention on Instagram and TikTok last year, integrates multiple generative models, both proprietary and third-party, such as Sora, Veo, Kling, WAN, and Seedance, into a single workflow. Within that system, users can control camera movement, lenses, shot composition, color grading, and character consistency in one place.
Higgsfield co-founder and CEO Alex Mashrabov said AI is closing the gap between creation and audience feedback. “The interface layer is where that abstraction becomes reality for the user, and is where most AI video platforms fundamentally underinvest. The common design assumption in this space was to expose the functionality of the model and let sophisticated users understand the workflow. We took the opposite position,” he told me. “40% of the Higgsfield team are filmmakers, producers, and creatives who define our product roadmap and work alongside our ML engineers in a constant feedback loop.”
Mashrabov revealed that the platform’s AI-powered inference engine has collected real preference signals from more than 700 million real user generations to date. “Over time, this will allow us to fine-tune and optimize for specific creative use cases in ways that cannot simply be replicated by common providers. The feedback loop between production behavior and model performance is the deepest moat in this space, but it takes time to accumulate and compound,” he said.
The platform’s level of control aims to address the persistent problem of inconsistency in AI video tools. The platform helps creators move closer to reproducible production-level output by introducing more definitive workflows and persistent character systems. Additionally, Higgsfield introduced a crowdsourcing model for content development through its “Original Series” initiative. The platform allows viewers to look at pilot concepts and decide which ones to move forward with, rather than relying on an internal green light. Creators generate ideas, audiences evaluate them, and the strongest concepts advance to further production and distribution.
“What’s emerging is the ability to hold a complete creative vision, break it down across character, tone, optics, pacing, worldbuilding, and use these (AI-powered) tools to execute with precision. In some ways, it becomes more demanding because the abstraction layer removes any excuse for technical limitations. You can’t blame budget or equipment anymore. The work is a direct expression of your creative decisions,” Mashrabov said.
The platform claims to have expanded to more than 240 regions within a year of launch and reached an annual operating rate of $300 million. “With 24 million users generating 5 million videos per day, the scarcity of production access, technical skills, and reach that once defined creative value has virtually disappeared,” Mashrabov revealed. “The same platforms that independent filmmakers use to create original series pilots are used by Fortune 500 marketing teams to create campaign content at scale. The same tools serve both.”
When interaction and creation meet
When viewed together, D-ID and Higgsfield represent two aspects of the same transformation. D-ID redefines the way users engage with video, turning it into an interactive interface. Meanwhile, platforms like Higgsfield are turning video generation into a programmable system that evolves based on data and feedback.
As video becomes more adaptable, new questions arise around accuracy, transparency, and control. It’s important to ensure that your answers are based on verified content. It will be equally important to visualize the logic behind those responses through citations, assumptions, or validation layers. D-ID addresses part of this challenge by anchoring the response to the original script and a managed knowledge source.
“The responses are initially locked into the video script, so the agent is not free-associative. External knowledge sources are additive, not primary. They are like subject matter experts who can deeply study a particular document and refer to the broader context as needed,” Perry said. “While no system can completely eliminate drift, this architecture is designed to allow authors to intentionally add additional relevant information as knowledge, thereby allowing them to set boundaries and allow for broader context or narrower limits.”
The ongoing transformation is less about improving video and more about repositioning it within the digital stack. As AI integrates both the consumption and creation layers, video is quietly beginning to operate as a living system that is responsive, adaptive, and continuously evolving.

