ShengShu Technology launches Vidu S1, bringing real-time interactive generation to AI videos

AI Video & Visuals


Singapore, July 3, 2026 /PRNewswire/ — ShengShu Technology Announced Today at 2026 Global Digital Economy Conference Vidu S1is a next-generation video foundation model that takes AI video from the creation of a single clip to real-time interactive video generation that enables continuous live interaction.

Vidu S1 supports real-time video conversations with voice-guided character control, allowing users to naturally control AI avatars through voice input with unlimited continuous interaction. The model provides 540P (960×540) resolution, 25 FPS (up to 42 FPS) Users can instantly create personalized interactive characters from a single image, including real people, animated characters, pets, and more, paired with customizable voices. Together, these features create a more natural, fluid, and immersive real-time interactive experience. Notably, the entire system runs on consumer-grade GPUs, significantly reducing the hardware requirements for real-time interactive video generation.

From offline video generation to real-time interaction

Most existing video generation models work in offline workflows. That is, the user submits a prompt, waits for the video to be generated, and displays the completed result. Once generated, the content remains fixed. Changing an AI avatar’s actions or storyline typically requires generating a new video, and interaction is limited to a one-way creation and viewing experience.

Vidu S1 introduces a real-time interactive video generation framework that allows users to provide continuous voice input through real-time video conversations. This model processes audio input along with conversational context and current visual context, allowing subsequent video content to be generated and updated in real-time.

Vidu S1 advances voice interaction beyond real-time generation, from simple lip-sync to full AI avatar control. Rather than relying on audio-driven lip movements or predefined animation libraries, the model interprets the semantic meaning, intent, and emotional context of audio input to generate synchronized lip movements, facial expressions, eye movements, gestures, body posture, and whole-body movements in real time.

Together, these features enable AI avatars to understand user instructions, respond naturally during conversations, and support continuous real-time interactions.

Unlimited real-time video generation

Currently, most video generation models produce fixed length clips, typically ranging from a few seconds to tens of seconds. Once generation begins, users will not be able to influence the video’s evolution.

Vidu S1 uses an autoregressive diffusion (AR + Diffusion) architecture. Rather than generating the entire video in advance, it continually predicts and generates subsequent video content based on previously generated frames, current voice instructions, and conversational context. As users give new instructions, the model can update the character’s facial expressions, movements, and subsequent actions in real time, allowing the interaction to continuously evolve through conversation.

Vidu S1 is the leading model for real-time interaction as well as real-time video generation of unlimited duration. Continuous generation alone is not enough for this. At the same time, the model must retain the character’s identity, maintain natural and consistent movements, continuously process user input, and respond in real time throughout long conversations.

Together, these features enable Vidu S1 to enable persistent generative video interactions, ensuring character responsiveness, visual consistency, and continuous interactivity over long periods of time.

Video call quality interactions with 540P at 25 FPS

Delivering real-time interactive video requires not only streaming generation, but also the resolution and frame rate necessary to support natural, responsive conversations.

To meet these requirements, ShengShu Technology has optimized Vidu S1 across model acceleration, inference, and system deployment. Real-time interactive video generation at 540P (960×540) resolution and 25 FPSwith support to 42FPS.

At the model level, Vidu S1 features: ShengShu Technology Including inference acceleration technology turbo body fusion [1]low bit Sage caution [2]and sparse attention methods like SLA [3] and Spurge caution [4]. Through multi-step generation, model quantization, and optimized inference kernels, Vidu S1 supports high frame rate output while significantly reducing the computational cost of generating each frame. This efficiency allows Vidu S1 to perform interactive generation in real time. consumer grade GPUrather than the large server clusters that such workloads typically require.

At the system level, turbo serve [5]ShengShu Technology’s inference service engine efficiently schedules inference workloads while preserving user input, character state, and visual context throughout the interaction. Computing resources are dynamically allocated based on interaction status to support stable, low-latency, real-time interactive video generation.

These model-level and system-level optimizations enable Vidu S1 to deliver continuous, stable, and responsive real-time interactive video generation over long interactions.

These features provide the technical foundation for applications such as: real-time video conversationinteractive live streaming, AI companionship, interactive gaming, and XR experiences.

Create interactive characters from a single image

Creating a traditional AI avatar typically requires multiple image or video assets, followed by character modeling, rigging, lip-sync configuration, and dedicated training before the character can be used for interaction.

Vidu S1 introduces a fully generative workflow that eliminates the need for character-specific modeling and training. Users simply upload a single image, and the model captures the character’s identity, appearance, and visual style, generating synchronized lip movements, facial expressions, gestures, and full-body movements in real-time.

Turn a single image into a real-time, interactive character, whether it’s based on a real person, an animated character, or a pet. Vidu S1 also supports customizable voices, allowing each character to have a consistent visual and audio identity.

Vidu S1 greatly facilitates the creation of personalized, real-time interactive characters by reducing character creation from a multi-step production pipeline to a single-image workflow.

A new chapter in interactive AI video

As video infrastructure models continue to evolve, industry competition is expanding beyond image quality, production speed, and video length toward broader capabilities in real-time responsiveness, continuity, control, and interaction.

With Vidu S1, real-time interactive video generation allows AI video to move beyond pre-generated content to dynamic, responsive experiences where AI can understand user input, respond in real-time, and continuously evolve through interaction.

In the future, Vidu S1 is likely to support a wide range of applications, including AI companions, AI virtual influencers, interactive live streaming, gaming NPCs, branded AI avatars, intelligent customer service, online education, and XR experiences. These capabilities enable AI avatars to evolve from one-time content assets to persistent, always-on, conversational agents.

From generating individual video clips to enabling continuous interactions, and from one-way content creation to real-time two-way engagement, Vidu S1 extends the capabilities of the video foundation model and lays the foundation for the next generation of interactive AI experiences.

availability

Vidu S1 is now generally available and allows users to create and interact with AI avatars from their own custom images in real-time. An API platform is also available for developers and corporate partners to build real-time interactive applications.

Global experience: https://www.vidu.com/vidu-stream

API: https://platform.vidu.com/live/landing

References
[1] TurboDiffusion: Speed ​​up video diffusion models by 100-200x.
[2] Sage Attention: Precise 8-bit attention for plug-and-play inference acceleration.
[3] SLA: Beyond diffusion transformer sparsity with fine-tunable sparse linear attention.
[4] SpurgeAttend: Accelerate any model inference with accurate, training-free sparse attention.
[5] TurboServe: Efficiently and economically provides streaming video generation.

Source ShengShu Technology



Source link