Google's DeepMind AI can generate music for videos

AI Video & Visuals


Google has rolled out updates to DeepMindAI and the ability to generate music to accompany videos, creating professional soundtracks.

The video-to-audio process combines video pixels with natural language text prompts to generate a soundscape for your video. Google combines its V2A technology with video generation models such as Veo to create shots with dramatic scores, realistic sound effects, and dialogue that match the character and tone of the video. The models can also generate soundtracks for traditional footage from archival material, silent films, and more.

Google says the new process gives audio engineers more creative control by allowing them to generate an unlimited number of sound tracks from any video input. Engineers can change the mood of the music using positive and negative prompts. Positive prompts steer the model toward a desired sound outcome, while negative prompts steer the model away from undesirable sounds.

How does DeepMind AI's video-to-audio technology work?

Google says it experimented with autoregressive and diffusion approaches to find the most scalable AI architecture. A diffusion-based approach to speech generation produced the most realistic and convincing results for synchronizing video and audio information. The V2A system starts by encoding the video input into a compressed representation. Google's diffusion model then iteratively refines the speech from random noise. The process is guided by visual input from the video and natural language prompts created by engineers.

The result is synchronized, lifelike audio that closely matches the prompt's instructions and video content. “To generate higher quality audio and add the ability to guide the model to produce specific sounds, we added more information to the training process, including AI-generated annotations with detailed descriptions of the audio and a transcript of the conversation,” Google said.

Training the model with video, audio, and additional annotations means that the technology learns to associate specific audio events with different visual scenes while responding to information provided in the annotations or transcript. Imagine a soaring score that culminates as the video passes a mountain peak; it evokes a certain majesty.

According to Google, the model relies heavily on high-quality video footage to generate high-quality audio. Any artifacts or distortions in the video can significantly degrade the audio quality. The company is also working on lip-syncing technology for videos featuring characters, but the model can produce inconsistencies, resulting in unnatural lip-syncing, such as when a character is speaking but their lips are not moving.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *