Researchers at Google DeepMind have developed an AI-powered model called Video-to-Audio (V2A) that can generate audio and dialogue for videos, a development that marks a significant step towards using AI to create complete audio-visual experiences.
How Google's V2A AI model works
Video-to-audio (V2A) AI technology is suited to videos generated by AI models such as Google's Veo, announced at Google I/O 2024. V2A technology works by combining video information with text prompts.
Users can provide additional instructions to guide the V2A system towards the specific sound they want to create for their video, giving them creative control over the generated soundtrack.
“Today, we are announcing progress on our video-to-audio (V2A) technology, which enables synchronized audio-visual generation. V2A combines video pixels with natural language text prompts to generate a rich soundscape for the on-screen action,” the company said.
“Our V2A technology can be combined with generative video models like Veo to create shots with dramatic music, realistic sound effects and dialogue that fit the character and tone of your video,” he added.
V2A first encodes the video, then uses a diffusion model to refine random noise into lifelike audio that matches the video and the provided text prompts, and finally decodes the audio and combines it with the video data.
Some use cases include generating soundtracks for silent video or traditional footage, such as archival material or silent films.
“To generate higher quality speech and add the ability to guide the model to produce specific sounds, we added more information to the training process, including AI-generated annotations with detailed descriptions of the audio and transcripts of the conversation,” Google DeepMind said.
The AI model is trained on video, audio and additional annotations, and is said to help associate specific audio events with different visual scenes, while also responding to information provided in the transcript.
Limitations of AI models
According to the researchers, the quality of the audio produced will depend on the quality of the video input, and lip movements in videos produced by other models may not perfectly match the soundtrack created by V2A.
How Google's V2A AI model works
Video-to-audio (V2A) AI technology is suited to videos generated by AI models such as Google's Veo, announced at Google I/O 2024. V2A technology works by combining video information with text prompts.
Users can provide additional instructions to guide the V2A system towards the specific sound they want to create for their video, giving them creative control over the generated soundtrack.
“Today, we are announcing progress on our video-to-audio (V2A) technology, which enables synchronized audio-visual generation. V2A combines video pixels with natural language text prompts to generate a rich soundscape for the on-screen action,” the company said.
“Our V2A technology can be combined with generative video models like Veo to create shots with dramatic music, realistic sound effects and dialogue that fit the character and tone of your video,” he added.
V2A first encodes the video, then uses a diffusion model to refine random noise into lifelike audio that matches the video and the provided text prompts, and finally decodes the audio and combines it with the video data.
Some use cases include generating soundtracks for silent video or traditional footage, such as archival material or silent films.
“To generate higher quality speech and add the ability to guide the model to produce specific sounds, we added more information to the training process, including AI-generated annotations with detailed descriptions of the audio and transcripts of the conversation,” Google DeepMind said.
The AI model is trained on video, audio and additional annotations, and is said to help associate specific audio events with different visual scenes, while also responding to information provided in the transcript.
Limitations of AI models
According to the researchers, the quality of the audio produced will depend on the quality of the video input, and lip movements in videos produced by other models may not perfectly match the soundtrack created by V2A.
