AI-generated video is already a reality, but now another player has joined the fray: Microsoft. Apparently, the tech giant has developed a generative AI system that can generate realistic speaking avatars from a single photo and audio clip. This tool, which he named VASA-1, does more than just imitate mouth movements. Capture authentic emotions and even create natural movements.
The system provides users with the ability to change the subject's eye movements, the distance at which the subject is perceived, and the emotions expressed. VASA-1 is the first model in what is rumored to be a series of AI tools, and MSPowerUser reports that VASA-1 can conjure up specific facial expressions, highly synchronize lip movements, It can produce human-like head movements.
With a wide range of emotions to choose from and the ability to generate subtle facial expressions, the results seem frighteningly convincing.
How VASA-1 works and what it can do
Apparently inspired by the way human 3D animators and modelers work, VASA-1 uses a process called “disentangling” that allows the system to separate facial expressions, 3D head positions, and facial features independently of each other. Allows you to control and edit. This is what strengthens the realism of his VASA-1.
As you may have already imagined, this has the potential to be seismic and completely transform the experience of digital apps and interfaces. According to MSPowerUser, VASA-1 can generate different videos than what it was trained on. Apparently, the system isn't trained on artistic photography, singing voices, or non-English speech, but if you request a video featuring any of these, it will respond.
Microsoft researchers who developed VASA-1 praise its real-time efficiency, saying the system can produce fairly high-resolution video (512 x 512 pixels) at high frame rates. Frame rate, or frames per second (fps), is the frequency at which a series of images (called frames) can be captured or displayed in succession in media. Researchers claim that VASA-1 can generate videos at 45fps in offline mode and 40fps in online generation.
You can check the status of VASA-1 and learn more on Microsoft's dedicated webpage for the project. It includes several demonstrations, links to download information about it, and a section at the end headed “Risks and Responsible AI Considerations.”
It works like a charm. But will it be a miraculous spell or spell disaster?
In this final discussion section, Microsoft acknowledges that there is plenty of room for abuse with such tools, but the researchers seek to highlight the potential benefits of VASA-1. they are not wrong. Such technology could mean next-level educational experiences available to more students than ever before, better support for people with communication difficulties, the ability to provide companionship, and improved digital therapeutic support. there is.
That being said, it would be foolish to ignore the potential for harm and fraud with something like this. Microsoft says it currently has no plans to release VASA-1 to the public in any form until it can be assured that “this technology will be used responsibly and in accordance with appropriate regulations.” If Microsoft sticks to this path, I think we could be in for a long wait.
Overall, I think it's becoming harder to deny that generative AI video tools are becoming more commonplace and the countdown has begun until they permeate our lives. Google is working on a similar AI system under the name VLOGGER, and recently detailed how VLOGGER creates realistic videos of people moving, talking, and gesturing with his single photo input. We also published a paper on
OpenAI also recently made headlines by announcing Sora, a proprietary AI video generation tool that can generate videos from text descriptions. OpenAI explained how Sora works in a dedicated page, providing a demonstration that impressed and even alarmed many people.
I'm wary of what these innovations will allow us to do, but I'm glad that, as far as we know, all three of these new tools are being kept a closely guarded secret. Realistically, we believe that the best guardrail against misuse of such technology is strong regulation, but we doubt that all governments will take these steps in time.
