Google announced Gemini Omni, a new family of generative AI models that can create and edit videos by combining text, images, audio, and existing clips.
Gemini Omni marks a significant expansion of Google’s consumer AI tools, moving beyond last year’s image-focused Nano Banana to full multimodal video generation, according to the tech giant. Users can enter different types of input, including photos, sketches, audio recordings, and written prompts, and Omni can stitch them together into one coherent video. Editing is conversational. Authors can ask the system to change the background, add objects, or change the action while the scene maintains consistent character and physical plausibility across multiple turns.
Google emphasized that Omni is based on Gemini’s real-world knowledge, allowing it to infer historical, scientific, and cultural context, rather than simply creating visually compelling but meaningless footage. According to the company, this model provides a better intuitive understanding of forces such as gravity and fluid dynamics, allowing for more realistic movements.
The Digital Avatar feature allows users to generate versions of themselves that look and sound like themselves, but Google is testing safeguards around audio and audio editing more broadly. All output includes an imperceptible SynthID watermark, and validation tools are available through the Gemini app, Chrome, and Google Search.
Gemini Omni Flash is currently available to Google AI Plus, Pro, and Ultra subscribers worldwide and through Google Flow. It will also appear in the YouTube Create app this week. The first release, Gemini Omni Flash, is currently available for free to paid subscribers and YouTube Shorts users. Developer and enterprise access via the API is planned in the coming weeks. Google also plans to extend the Omni family to image and audio generation by building a native multimodal architecture from the ground up.
