Google DeepMind's robotics team is teaching robots to learn in the same way that human interns do by watching videos. The team has published a new paper showing how Google's RT-2 robot, which incorporates a Gemini 1.5 Pro generative AI model, absorbs information from videos to learn how to navigate and execute requests at a destination.
Thanks to the Gemini 1.5 Pro model's long context window, it's possible to train the robot like a new trainee. This window allows the AI to process large amounts of information simultaneously. Researchers film a video tour of a designated area, such as a home or office. The robot then watches the video and learns about the environment.
The video tour details how the robot uses both audio and visual output to complete tasks based on learned knowledge. It's an impressive way to show how a robot can interact with its environment in a way that is reminiscent of human behavior. You can see how it works and examples of different tasks the robot might perform in the video below.
Limited context length makes it hard for many AI models to remember their environment. 🌐1.5 Pro's 1 million token context length allows our robots to successfully find their way in space using human instructions, video tours, and common sense reasoning. pic.twitter.com/eIQbtjHCbWJuly 11, 2024
Robot AI Expertise
These demonstrations aren't random happenings either: In real-world tests, the Gemini robot operated within a 9,000-square-foot area and followed more than 50 user commands with a 90 percent success rate. This high level of accuracy opens up a wide range of practical applications for AI-enabled robots, from helping with chores around the home to simple tasks at work, to more complex tasks.
That's because one of the Gemini 1.5 Pro model's most notable features is its ability to complete multi-step tasks. DeepMind research has found that the robot can understand how to navigate to a fridge, visually process what's inside, and then walk back to answer questions, like whether a particular drink is available.
The idea of planning and executing an entire sequence of movements demonstrates a level of understanding and execution that goes beyond the single-step instructions that are the current standard for most robots.
But don't expect this robot to go on sale anytime soon. For starters, it takes up to 30 seconds to process each instruction, which is much slower than doing something on its own in most cases. No matter how advanced the AI models are, it's still much harder for a robot to navigate the chaotic environments of a real home or office than it is in a controlled environment.
Still, integrating AI models like the Gemini 1.5 Pro into robotics is part of a major leap forward in the field. Robots equipped with models like the Gemini and its competitors could transform healthcare, delivery and even cleaning jobs.
