
Google / Benji Edwards
At Google I/O 2024 on Tuesday, Google announced Veo, a new AI video compositing model that, similar to OpenAI's Sora, can create HD videos from text, images, or video prompts. You can generate 1080p videos longer than 1 minute and edit the videos using written instructions, but it has not yet been released for widespread use.
Veo lets you edit existing video using text commands, maintain visual consistency between frames, and create video sequences of up to 60 seconds or more from a single prompt or a series of prompts that form a narrative. It is reported that it contains the ability to generate. The company says it can generate detailed scenes and apply cinematic effects such as time-lapse, aerial photography, and different visual styles.
Since the launch of DALL-E 2 in April 2022, we have introduced new image compositing models and tools that aim to enable anyone who can input a written description to create detailed images and videos. I've seen a parade of video synthesis models. Although neither technology is fully sophisticated, both AI image and video generators are steadily increasing their capabilities.
In February, we featured a preview of OpenAI's Sora video generator. At the time, many believed this to be the best AI video compositing the industry had to offer. Tyler Perry was so impressed by this that he put the movie studio's expansion on hold. However, so far, OpenAI has not provided public access to this tool, instead limiting its use to a select group of testers.
Now, at first glance, Google's Veo appears to have video generation capabilities similar to Sora. Since we haven't tried it ourselves, we can only rely on the carefully selected demonstration videos the company provides on its website. This means that those viewing them should take Google's claims with a grain of salt, as the results produced may not be representative.
Veo's sample videos include cowboys on horseback, high-speed chase shots through suburban streets, kebabs grilling, and time-lapses of sunflowers opening. Although they clearly lack detailed depictions of humans, it has been difficult to generate AI image and video models without obvious deformations.
Google says Veo is built on its previous video generation models, including Generative Query Network (GQN), DVD-GAN, Imagen-Video, Phenaki, WALT, VideoPoet, and Lumiere. To improve quality and efficiency, Veo's training data includes more detailed video captions and leverages compressed “latent” video representations. To improve Veo's video generation quality, Google added more detailed captions to the videos used to train Veo, allowing the AI to more accurately interpret prompts.
Veo also seems notable for its support for movie production commands. “Given both an input video and an editing command, such as adding a kayak to an aerial shot of a coastline, Veo can apply this command to the first video to create a new edited video,” the company says. To tell.
While the demo looks impressive at first glance (especially compared to Will Smith eating spaghetti), Google admits that generating videos using AI is difficult. “Maintaining visual consistency can be a challenge for video generation models,” the company writes. “Characters, objects, and even entire scenes can flicker, jump, or deform unexpectedly between frames, disrupting the viewing experience.”
Google is trying to alleviate these shortcomings with a “state-of-the-art latent diffusion transformer,” but this is basically meaningless marketing talk with no specifics. But the company has enough confidence in the model that it has teamed up with actor Donald Glover and his studio Gilga to produce an AI-generated demonstration film, which is expected to be released soon.
Initially, Veo will be accessible to selected creators through VideoFX, a new experimentation tool available on Google's AI Test Kitchen website labs.google. Creators can join her VideoFX waitlist and potentially gain access to her Veo features in the coming weeks. Google plans to integrate some of Veo's features into YouTube Shorts and other products in the future.
There's no word yet on where Google got its Veo training data (if I had to guess, it's likely YouTube involved). But Google says it is taking a “responsible” approach to Veo. According to the company, “Videos created by Veo are watermarked using SynthID, our state-of-the-art tool for watermarking and identifying AI-generated content, ensuring privacy, copyright, and It goes through a safety filter and memory check process that reduces the risk of bias.”
