Google is training the robot with Gemini AI to improve its navigation and task-completion abilities. DeepMind's robotics team explained in a new research paper that Gemini 1.5 Pro's longer context window, which determines how much information an AI model can process, allows users to more easily interact with the RT-2 robot using natural language instructions.
It works by taking a video tour of a designated area, such as a home or office space. Researchers use the Gemini 1.5 Pro to have the robot “watch” the video to learn about the environment. The robot can then execute commands based on what it observes using audio and/or image output, such as directing the user to a power outlet after showing them a phone and asking, “Where can I charge it?” According to DeepMind, the Gemini-powered robot achieved a 90% success rate for more than 50 user instructions in an operating area of over 9,000 square feet.
The researchers also found “preliminary evidence” that Gemini 1.5 Pro enabled the droid to plan how to carry out instructions, not just navigate. For example, when a user with a bunch of Coke cans on their desk asked the droid if it had a favorite drink, Gemini “recognized that the robot should navigate to the fridge, check if there was a Coke can, and return to the user to report the results,” the team said. DeepMind said it plans to investigate these results further.
While the video demo provided by Google is impressive, the obvious cuts after the droid recognizes each request hide the 10 to 30 seconds it takes to process these instructions, according to the research paper. It may be a while before more advanced environment-mapping robots live in our homes, but at the very least these robots might be able to find lost keys or wallets.