AI video generators like OpenAI's Sora, Luma AI's Dream Machine, and Runway Gen-3 Alpha have been gaining attention recently, but a new tool from Google DeepMind could fix one weakness they all share: the lack of accompanying audio.
A new post from Google DeepMind has revealed a new video-to-audio (or “V2A”) tool that uses a combination of pixels and text prompts to automatically generate soundtracks and soundscapes for AI-generated videos — meaning it's another big step towards the creation of fully automated movie scenes.
As you can see in the video below, this V2A technology can be combined with AI video generators (including Google's own Veo) to create atmospheric music, timely sound effects, and even dialogue that Google DeepMind describes as “matching the character and tone of the video.”
Creators don't have to limit themselves to just one audio option: DeepMind's new V2A tool can apparently generate “an unlimited number of soundtracks for any video input” for any scene, meaning a few simple text prompts can get you closer to your desired result.
Google says the tool is better than competing technologies because it can generate speech based solely on pixels. Providing guiding text prompts appears to be entirely optional. But DeepMind is also well aware of the huge potential for the tool to be misused or deepfaked, which is why the V2A tool is currently limited to being a research project.
DeepMind says that “our V2A technology will undergo rigorous safety evaluation and testing before considering it for broader public release.” It will certainly need to be rigorous, as the 10 short video examples show that this technology has explosive potential, for better and for worse.
The potential for amateur filmmaking and animation is huge, as can be seen in the “horror” clip below, as well as the cartoon baby dinosaur clip. Blade RunnerThere's also a sci-fi-esque scene (below) where cars glide through the city accompanied by an electronic music soundtrack, showing how sci-fi movies can be significantly slashed in budget.
Worried creators will at least take some solace in the conversation limitations revealed in the “Clay Animation Family” video, but if the last year has taught us anything, it's that DeepMind's V2A technology is only going to improve dramatically in the future.
From now on, voice actors will no longer be necessary.
The combination of AI-generated video and AI-created soundtracks and sound effects is a game-changer in many ways, adding a new dimension to an already heated arms race.
OpenAI has already revealed plans to add audio to its Sora video generator, due for release later this year, but DeepMind's new V2A tool shows that the technology is already at an advanced stage where it can create audio purely based on video, without repeated instructions.
DeepMind's tool works using a diffusion model that combines information taken from the pixels of the video with the user's text prompts and spits out compressed audio that is then decoded into an audio waveform. The tool appears to have been trained on a combination of video, audio, and AI-generated annotations.
It's unclear what content the V2A tool was trained on, but Google's ownership of YouTube, the world's largest video-sharing platform, gives it a potentially huge advantage. Neither YouTube nor its terms of service fully disclose how its videos will be used to train the AI, but YouTube CEO Neil Mohan recently told Bloomberg that some creators have agreements that allow their content to be used to train AI models.
Obviously, the technology is still limited when it comes to dialogue, and is still a long way from producing Hollywood-worthy finished works, but it can already be a powerful tool for storyboarding and amateur filmmakers, and with stiff competition from the likes of OpenAI, it's only likely to improve rapidly from here.
