Google DeepMind's new AI tool uses video pixels and text prompts to generate soundtracks

Google DeepMind has announced a new AI tool for generating soundtracks for videos. DeepMind's tool not only generates audio using text prompts, but also takes into account the content of the video.

Combining the two, DeepMind says that the tool allows users to create scenes with “dramatic music, realistic sound effects, or dialogue that match the character and tone of your video.” You can see some of the examples on DeepMind's website, and they look pretty good.

In one video of a car driving through a cyberpunk cityscape, Google generated audio using the prompts “car skidding, car engine throttling, angelic electronic music.” Notice how the skidding sounds match the car's movement. In another example, they used the prompts “jellyfish, sea life, ocean pulsating underwater” to create an underwater soundscape.

Users can add text prompts, but DeepMind says that's optional, and users don't have to meticulously align the generated audio to the right scenes. DeepMind says the tool can generate “unlimited” soundtracks for videos, allowing users to generate an endless number of audio options.

This could help it stand out from other AI tools like ElevenLabs' sound effects generator, which uses text prompts to generate audio, and it also makes it easier to combine audio with AI-generated video from tools like DeepMind's Veo and Sora (the latter of which will eventually incorporate audio).

DeepMind says it trained its AI tool on video, audio, and annotations that include “detailed audio descriptions and dialogue transcripts,” which allows the video-to-audio generator to match audio events with visual scenes.

The tool still has some limitations. For example, DeepMind is trying to improve its ability to synchronize lip movements with speech, as seen in a video of a clay-animated family. DeepMind also notes that its video-to-audio system relies on the quality of the video, so anything that's grainy or distorted “can lead to a noticeable degradation in audio quality.”

Source link