Google's artificial intelligence lab, DeepMind, has taken AI-created video content a step further, bringing traditional film and TV production (not to mention sync licensing) one step closer to becoming obsolete.
DeepMind said in a blog post published on Monday (June 17) that it is developing “video-to-audio” (V2A) technology that would combine AI-created music, sound effects, and even dialogue with AI-generated video.
“Video generation models are advancing at an incredible pace, but many current systems are only capable of producing silent output,” DeepMind wrote.
“One of the next big steps in bringing generated films to life is creating soundtracks for these silent videos.”
DeepMind says its technology has an advantage over other projects that add sound to AI-generated videos because it “understands the raw pixels,” and while users can provide text prompts, it's not actually necessary, as the AI technology can determine on its own what sounds are appropriate for a given video.
DeepMind says the technology can also automatically sync audio and images (no more audio editors needed).
DeepMind's blog features a number of text-prompted video clips where sound has been added to the video, including a movie score (prompt: “movie, thriller, horror movie, music, tension, atmosphere, footsteps on concrete”), an underwater scene (prompt: “jellyfish pulsating underwater, marine life, ocean”), and even a person playing a guitar (see below).
“Early results indicate that this technology could be a promising approach for bringing generated movies to life,” the DeepMind blog said.
The lab says the technology is trained on audio, video and transcripts of conversations and is enhanced with “AI-generated annotations that provide detailed descriptions of sounds.”
Notably, the institute did not say whether the audio, video, or text transcripts were copyrighted or whether the materials had been licensed for use in AI training, stating only that DeepMind is “committed to developing and deploying AI technologies responsibly.”
“Early results indicate that this technology could be a promising approach to bring generated films to life.”
Google DeepMind
Google's approach to AI training and copyright has been difficult to interpret: The company's YouTube division partners with major record labels and artists to develop its AI music tools with their approval, but Google told the U.S. Copyright Office last year that using copyrighted material to train its AI should be considered fair use.
At the moment, it appears that V2A technology has not yet reached full deployment, meaning it is not available to the general public.
“There are many other limitations we are trying to address, and further research is ongoing,” DeepMind said.
One area where the lab says it needs to improve is voice dialogue generation. Current V2A technology “can't[s] “The video model does not produce mouth movements that match the transcript, resulting in eerily erratic lip syncing,” DeepMind said.
DeepMind also said that audio quality will degrade if the video input contains “artifacts or distortions” that the V2A technology was not trained to detect.
Still, such audio-to-video technology is clearly the missing link to creating complete audiovisual content in real time using AI.
As the AI boom continues, many developers are working on voice generation technology. For example, earlier this month Stability AI We have launched Stable Audio Open, a free, open source model that allows users to create high quality audio samples.
It's not meant to create full-length music tracks, but rather allows you to create snippets of up to 47 seconds that contain sound effects, drum beats, instrumental riffs, ambience, and other production elements commonly used in music and sound design.
Over the past few months, AI video creation tools have also been released that can create stunningly realistic videos. Open AI“SORA” by is a work that has become a hot topic this spring, with its realistic depictions of people, animals and landscapes.
Soon other AI video generators emerged, all vying for the title of “SoraKiller” and hailed as the best yet. LumaLab“Dream Machine” RunwayGen-3 Alpha, and more recently the Chinese video platform QuickCling.
With realistic AI video generation now in users' hands, the issue of deepfakes has become all the more urgent, which is perhaps part of the reason why Google's DeepMind has been hesitant to release its latest technology, which, when perfected, will be able to add realistic sound effects and voices to AI-generated videos.
DeepMind said in a blog that it has integrated its SynthID tool into V2A productions. SynthID is a technology that adds a digital watermark to AI-created content, making it identifiable as the product of an AI tool.
DeepMind also took into consideration audiovisual creators who are at risk of losing their jobs to these new AI tools.
“To ensure that our V2A technology can have a positive impact on the creative community, we are gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development,” the blog read.Global Music Business
