Just two months after the first text-to-video conversion tools were made public, AI is already producing high-quality advertisements and short films. Satyen K. Bordoloi explored the field and argues that we are witnessing the next Lumiere Brothers when it comes to AI filmmaking.
On March 22, 1895, the Lumiere brothers unveiled the world's first “moving picture.” They showed their invention to the world. A few years later, short films were made and shown using this technology, and about 20 years later, feature films joined the trend.

On February 15th of this year, OpenAI debuted its text-to-video AI model, Sora. On June 10th, Kuaishou Technology launched Kling in China. Two days later, on June 12th, Luma AI became the world's first text-to-video (TTV) tool and made it available to everyone. A month and a half later, several AI-only movie ads, short films, and trailers have already been released to the world, and the first AI-only feature film is expected to be released this year.
It took almost 20 years for the Lumiere brothers' invention to become widespread, but an equally monumental leap in filmmaking, text-to-video, would come within a year.The juggernaut of mass-production logic (artificial intelligence) that creates certain art will keep on going, AI apocalypse notwithstanding.
Text to video conversion examples:
Leonardo Da Vinci used paint. Emily Dickinson and Walt Whitman sculpted words into poetry, and Akira Kurosawa painted poetry on celluloid. Anyone can dabble in any art form. But true mastery takes years of specific study and, as every artist will attest, decades of dedicated practice. What if you’re not a “particularist” but rather a generalist? Maybe you have a little bit of Da Vinci, Dickinson, Whitman, and Kurosawa mixed in you, and you know all the technical stuff, but you just don’t feel like creating art because you’re not a master of anything or you’re an introvert. Can a generalist be an artist? Especially in a film that requires a specific skill set? It turns out that with AI, you can get skills that rival even Kurosawa or Spielberg.
Words can paint pictures, not just in your head, but on a screen too. This is the realm of text-to-video, a technological marvel that, as we saw in the examples above, is terraforming the very way our ancient planet creates and consumes visual content, from social media reels to movies. For ordinary people, it's like having a personal director in your pocket, able to transform your wildest imaginations into a two-dimensional reality on any screen, with just a few keystrokes.
The technologies behind TTV:
Text to video conversion, as the name suggests, is the process of converting written text into video. When you input text into a computer, an algorithm deciphers the meaning of the words you input, generates corresponding images (sometimes you can input images directly), and seamlessly splices them together with audio (if your AI software has the capability) to create a moving image, or video.
At the heart of this technology are complex models such as Transformers and Diffusion, which enable machines to understand and generate human-like text and images. Known for their prowess in natural language processing, Transformers break down text into meaningful units and capture relationships between words. Meanwhile, Diffusion models excel at generating images by gradually adding details to a noisy starting point. Combining these powerful tools, text-to-video systems can bring written descriptions to life with incredible accuracy and creativity.
The best players on the market:
At the moment, the best companies that already have AI models available are Luma, Runway ML's Gen 3 Alpha, and Kling. OpenAI was the first to announce Sora but has yet to release a mass market product, while Google's Veo is on the way and Pika is making progress among many other companies emerging in this new field around the world.
I found dozens (probably hundreds) of “AI companies” that call themselves text-to-video generators. But all they do is rip off APIs from other text-to-video models, and most of them are substandard. They're kind of ripping off paying customers because in the land of the blind, one eye is king. Some companies allow the creation of deepfakes, which is dangerous because it can lead to non-consensual pornography and other harmful content.
But the three I mentioned – Luma, Runway Gen 3 Alpha and Kling – are the best out there right now. These platforms allow you to not only experiment with the technology but also create videos with different levels of control and customization. The feeling you get when you see your words, images you've taken or scraped from elsewhere transformed into a high-quality, sometimes cinematic video is nothing short of the overwhelming awe you felt when watching your first magic show.
Examples of using TTV:
Simply put, anywhere video is needed, there is a use case for text to video. From short social media content like reels, to initially being used in SFX, VFX, and establishing shots in movies, the use cases for TTV are as extensive as the imagination of creators. But that's as simple as it gets. There are some really creative uses as well. For example, in education, teachers can create quick and cheap TTV videos to explain scientific principles. TTV can be used for film restoration. Two minutes lost from a two-hour old movie can be recreated using AI, previous frames from the movie, and your imagination. Given a lost movie shooting script full of camera directions and a few people (the director is best) who saw the actual movie before it was lost in time, you can use TTV to recreate the entire movie.
How TTV is changing the world:
The most obvious way TTV achieves this is by democratizing filmmaking. You can call this the advent of camera-less filmmaking, something I have been saying repeatedly in my previous Sify articles. Today, to make a film, you need at least a camera or a laptop. You can make a very low-budget film by getting friends and family to act, shooting with a cheap camera or a mobile phone, and editing on a laptop. But that alone doesn't allow you to make a film loaded with special effects. But with TTV, even the camera element is eliminated. You don't even need a camera to make a film of a quality that is almost comparable to a popular Hollywood action movie. All you need is a few hundred dollars and unlimited creativity and flexible imagination to turn your ideas into movies.
Hollywood, Bollywood collapse:
Filmmaking would transform the music industry in the early 2000s. The rise of the internet removed the power of record companies to determine what “good music” was and regulate the choices, the “long tail” emerged, people listened to the music they liked, and even small creators could become very successful. Of course, the overall quality of lyrics and songs declined as utterly crappy music was made in the name of music. Today, all you need to make a song is a phone. But now everyone has the tools to create and experiment with what they like, and it has given rise to some amazing musicians who wouldn't have existed in the old world order.
With TTV, the same thing will happen with movies. Anyone with a phone will be able to make a feature film, and the world will be flooded with movies that look good but might not be good. Moviemaking will become as accessible to everyone as writing a blog post or a song. There will be tools that use AI to automate the entire process, making it possible for you or me to make a feature film with little to no technical skill.
This will dramatically change the old film industry. You would think that technology companies as film production companies like Netflix and Amazon Prime are the biggest change in the film industry. Just look at what happens when you inject AI on steroids into that. The heat of this new technology will melt the old hierarchical management system like a chain made of wax.
Indian cinema, especially Bollywood, will be dominated by talentless stars who demand exorbitant salaries at the expense of thousands of technicians and actors involved in the production of a film. Just as Web 2.0 gave birth to influencers and YouTube stars, Web 3.0, powered by AI, especially TTV, will give birth to the next superstar of cinema. And just like Andrew Niccol's prophetic film “S1m0ne,” this superstar may be fictional and nonexistent.
When the Lumière brothers traveled the world in the 1890s demonstrating the camera and its magic, they could never have imagined that just a century and a half later, their magic would be on the verge of being eclipsed by an even greater magic: the world was enriched by the Lumière brothers, and it would be enriched even more by the conversion of text to video.