In a cluttered, open-plan office in Mountain View, California, a tall, slender, wheeled robot is busy working as a tour guide and unofficial office helper. That's thanks to a large-scale language model upgrade, Google DeepMind revealed today. The robot uses the latest version of Google's Gemini large-scale language model to both parse commands and find its way around.
For example, a human can say, “Find me a place to write,” and the robot will dutifully walk off and guide the human to a clean whiteboard somewhere in the building.
Not only can Gemini process video and text, but it can also ingest large amounts of information in the form of historical video tours of the office, allowing it to understand its surroundings and navigate correctly when given commands that require common sense reasoning. The robot combines Gemini with algorithms that generate specific actions for the robot to take (such as turning) depending on the command and what it sees in front of it.
When Gemini was announced in December, Google DeepMind CEO Demis Hassabis told WIRED that its multimodal capabilities would likely unlock new capabilities for robots, adding that his company's researchers were hard at work testing the model's robotic potential.
In a new paper outlining the project, the researchers say the robot proved capable of navigating difficult instructions, such as “where is the roller coaster?” with up to 90% accuracy. DeepMind's system “significantly improved the naturalness of human-robot interaction and significantly enhanced the robot's ease of use,” the team wrote.
The demo nicely showcases the potential for large-scale language models to reach the real world and do useful work. Gemini and other chatbots mostly operate within the confines of a web browser or app, but they are increasingly able to process visual and auditory input, as both Google and OpenAI have demonstrated recently. In May, Hassabis showed off an upgraded version of Gemini that could understand the layout of an office viewed through a smartphone camera.
Academic and industrial research labs are racing to figure out how to use language models to make robots more capable, and the May program for the International Conference on Robotics and Automation, a popular event for robotics researchers, features nearly two dozen papers on the use of visual language models.
Investors are pouring money into startups aiming to apply AI advances to robotics. Some of the researchers involved in the Google project have since left the company to form a startup called Physical Intelligence, which has raised $70 million in initial funding and is working to combine large-scale language models with real-world training to give robots general problem-solving abilities. SkilledAI, founded by roboticists at Carnegie Mellon University, has a similar goal and announced $300 million in funding this month.
Just a few years ago, robots needed a map of their surroundings and carefully chosen commands to navigate successfully. Large language models contain useful information about the physical world, and newer versions, called visual language models, trained on images, videos and text, can answer questions that require perception. Gemini allows Google's robots to parse visual and spoken instructions, following a route sketched on a whiteboard to a new destination.
The researchers say in their paper that they plan to test the system with different kinds of robots, adding that Gemini should also be able to understand more complex questions, such as “Do you have my favorite drink today?” from a user who has a bunch of empty Coca-Cola cans on their desk.