From left to right, Alejandro Matamara Ortiz, Cristobal Valenzuela and Anastasis Germanidis in their New York office. (Photo: Justin J. Wie/New York Times)
Ian Sansabella, a software architect at a New York startup called Runway AI, entered a short description of what he would like to see in the video. “Quiet river in the woods,” he wrote.
In less than two minutes, an experimental internet service generated a short video of a quiet river in the forest. The river’s running water glistened in the sun, passed between trees and ferns, turned corners, and gently splashed over rocks.
Runway, which is about to launch to a small group of testers, is one of several companies developing artificial intelligence technology that can instantly generate a video by simply typing a few words into a box on your computer screen. is.
Giants such as Microsoft and Google to create a new kind of artificial intelligence system that some believe could be the next big thing in a technology as important as the web browser. It represents the next stage of industry competition involving not only much smaller start-ups. Or an iPhone.
New video generation systems may speed up the work of filmmakers and other digital artists, but they also represent a new and rapid way to create hard-to-detect online misinformation, and what is real on the Internet. This makes it even more difficult to determine whether
These systems are examples of what is known as generative AI, which can create text, images, and sounds on the fly. Another example is his ChatGPT, his online chatbot created by his OpenAI, a San Francisco startup that surprised the tech industry with its capabilities late last year.
Meta, the parent company of Google and Facebook, unveiled its first video-generating system last year, but fears the system could eventually be used to spread disinformation with new speed and efficiency. and did not release it to the public.
But Runway CEO Cris Valenzuela said he believes the technology is too important to keep in research labs, despite the risks. “This is one of the most impressive pieces of technology we’ve built in the last 100 years,” he said. “We need people who are actually using it.”
Of course, the ability to edit and manipulate film and video is nothing new. The filmmaker has been doing it for over a century. In recent years, researchers and digital artists have used a variety of AI technologies and software programs to create and edit videos, often referred to as deepfake videos.
But systems like the one Runway built will eventually replace editing skills with the push of a button.
Runway’s technology generates videos from short descriptions. First, enter a description just like you would enter a quick note.
This works best when the scene has some action, but not too much action, such as ‘rainy day in a big city’ or ‘dog with cell phone in the park’. Press Enter and the system will generate a video in 1-2 minutes.
This technology can reproduce common images such as cats sleeping on a carpet. Or you can combine different concepts to produce a weirdly funny video, like cows at a birthday party.
The video is only 4 seconds long and is choppy and blurry if you look closely. At times the images are strange, distorted and disturbing. The system has a way of fusing animals such as dogs and cats with inanimate objects such as balls and mobile phones. But given the right prompts, it makes a video that shows where the technology is headed.
Phillip Isola, an AI professor at the Massachusetts Institute of Technology, said: “But that will change soon.”
Like other generative AI technologies, Runaway’s system learns by analyzing digital data. In this case, photos, videos, and captions that describe what those images contain. By training this type of technology on ever-increasing amounts of data, we believe researchers will be able to rapidly improve and scale their skills. Experts believe that in no time you will be able to create professional-looking mini-movies complete with music and dialogue.
It’s difficult to define what the system is currently creating. not a photo. It’s not cartoon. It’s a collection of many pixels blended together to create a realistic video. The company plans to offer its technology along with other tools it believes will speed up the work of professional artists.
Several startups, including OpenAI, have released similar technology that can generate still images from short prompts like “picture of a teddy bear riding a skateboard in Times Square.” And the rapid advances in AI-generated photos could hint at where new video technology is headed.
Last month, social media services were flooded with images of Pope Francis in a white Balenciaga puffer coat. However, the image was not real. His 31-year-old construction worker in Chicago has become a viral sensation using a popular AI tool called Midjourney.
Isola has spent years building and testing this kind of technology, first as a researcher at UC Berkeley and OpenAI, then as a professor at MIT. Still, he was fooled by a crisp, high-definition, and completely fake image of Pope Francis.
“There was a time when people would post deepfakes, but they weren’t trying to fool me because it was too outlandish and not very realistic,” he said. “Nowadays, you can’t take the images you see on the Internet at face value.”
Midjourney is one of many services that can generate realistic still images from short prompts. Others include Stable Diffusion and DALL-E. This is the OpenAI technology that sparked this photo-generator wave when it was announced a year ago.
Midjourney relies on neural networks that learn skills by analyzing vast amounts of data. Search for patterns by combing through millions of digital images and text captions that explain what each image represents.
When someone describes an image of a system, a list of features included in that image is generated. One of the features might be the curve at the top of the dog’s ears. Another might be the edge of mobile phones. A second neural network, called a diffusion model, then builds the image and generates the pixels needed for the features. Ultimately it converts the pixels into a coherent image.
Companies like Runway, which has about 40 employees and raised $95.5 million, use the technology to generate videos. By analyzing thousands of videos, their technology can learn how to stitch together many still images in the same coherent way.
“A video is a series of frames (still images) that are combined in a way that gives the illusion of motion,” says Valenzuela. “The trick is to train the model to understand the relationships and consistency between each frame.”
Like earlier versions of tools like DALL-E and Midjourney, this technology sometimes combines concepts and images in interesting ways. If you ask for a teddy bear playing basketball, you might get something like a stuffed mutant with a basketball in his hand. If you look for a dog with a mobile phone in the park, you may find a puppy with a mobile phone in a strange human body.
However, experts believe that training the system on more and more data could fix the deficiencies. They believe the technology will ultimately make creating videos as easy as writing.
“In the old days, to do something remotely like this, you had to have a camera. You had to have props. You had to have a place. I had to,” said Susan Bonser, a writer and publisher from Pennsylvania.
“You don’t have to have anything now. Sit back and imagine.”