Google's V2A video-to-audio AI technology lets you add sound to any clip.

AI Video & Visuals


The craziest AI development we've seen this year is Microsoft's VASA-1 technology. The company has developed an AI model that can convert a single image of a person with an audio file into a video of that person speaking. While the demo was impressive, VASA-1 is not yet available as a commercial product, and we doubt it will ever be available commercially, as this kind of AI tool can easily be misused.

VASA-1 was revealed to the public in mid-April, and now, nearly two months later, Google Deepmind has announced a similar AI technology, which doesn't have a commercial name and Google describes it as video-to-audio (V2A) technology, meaning it's not a commercial AI product that you can try for yourself.

V2A allows you to generate audio from a single text prompt to go along with a silent video clip. Google's demo is amazing.

As Google explains in their blog, the video-to-audio tool “enables the generation of synchronized audio-visuals.” Google provided a number of examples to showcase their V2A technology, some of which are listed below, including the prompts Google used to generate the audio for the video:

Audio prompts: Movies, thrillers, horror movies, music, tension, atmosphere, footsteps on concrete

“V2A combines video pixels with natural language text prompts to create a rich soundscape for the on-screen action,” Google said, noting that V2A can be paired with Veo, the video generation model Google unveiled at I/O 2024. Veo is a direct competitor to OpenAI's Sora and other similar products.

According to Google, the V2A technology can deliver “dramatic music, realistic sound effects, or dialogue that matches the character and tone of your video.” The technology can be used to create soundtracks, but Google suggests one very exciting potential use: adding sound to silent films through video-to-audio conversion, which is awesome.

Voice prompt: Drummer on stage at a concert surrounded by flashing lights and a cheering crowd

But as Google explains later in the blog, voice generation isn't perfect: While V2A eliminates the need for manual adjustments to audio and video, it does have limitations, especially when it comes to voice.

We also improve lip sync for videos that include audio. V2A generates audio from an input transcript and attempts to synchronize it with the lip movements of the character. However, the paired video generation model may not be based on the transcript. This creates inconsistencies, often resulting in unnatural lip sync as the video model does not generate mouth movements that match the transcript.

Audio prompt: Music, Transcript: “This turkey is amazing, it makes me so hungry.”

Google also said it is seeking feedback from the creative community on the video-to-audio technology to ensure V2A has a positive impact. To prevent misuse, Google is adding its SynthID toolkit, which watermarks AI-generated content, to its V2A research.

It's unclear when V2A will be available to the public, but Google says the new technology will undergo rigorous testing. To get an idea of ​​what V2A can do at its current stage of development, check out some more demo clips at this link.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *