Imagine that your neighbor calls to ask a favor: Could you please feed their pet rabbit some carrot slices? Easy enough, you’d think. You can imagine their kitchen, even if you’ve never been there — carrots in a fridge, a drawer holding various knives. It’s abstract knowledge: You don’t know what your neighbor’s carrots and knives look like exactly, but you won’t take a spoon to a cucumber.
Artificial intelligence programs can’t compete. What seems to you like an easy task is a huge undertaking for current algorithms.
An AI-trained robot can find a specified knife and carrot hiding in a familiar kitchen, but in a different kitchen it will lack the abstract skills to succeed. “They don’t generalize to new environments,” said Victor Zhong, a graduate student in computer science at the University of Washington. The machine fails because there’s simply too much to learn, and too vast a space to explore.
The problem is that these robots — and AI agents in general — don’t have a foundation of concepts to build on. They don’t know what a knife or a carrot really is, much less how to open a drawer, choose one and cut slices. This limitation is due in part to the fact that many advanced AI systems get trained with a method called reinforcement learning that’s essentially self-education through trial and error. AI agents trained with reinforcement learning can execute the job they were trained to do very well, in the environment they were trained to do it in. But change the job or the environment, and these systems will often fail.
To get around this limitation, computer scientists have begun to teach machines important concepts before setting them loose. It’s like reading a manual before using new software: You could try to explore without it, but you’ll learn far faster with it. “Humans learn through a combination of both doing and reading,” said Karthik Narasimhan, a computer scientist at Princeton University. “We want machines to do the same.”
New work from Zhong and others shows that priming a learning model in this way can supercharge learning in simulated environments, both online and in the real world with robots. And it doesn’t just make algorithms learn faster — it guides them toward skills they’d otherwise never learn. Researchers want these agents to become generalists, capable of learning anything from chess to shopping to cleaning. And as demonstrations become more practical, scientists think this approach might even change how humans can interact with robots.
“It’s been a pretty big breakthrough,” said Brian Ichter, a research scientist in robotics at Google. “It’s pretty unimaginable how far it’s come in a year and a half.”
Sparse Rewards
At first glance, machine learning has already been remarkably successful. Most models typically use reinforcement learning, where algorithms learn by getting rewards. They begin totally ignorant, but trial and error eventually becomes trial and triumph. Reinforcement learning agents can easily master simple games.
Consider the video game Snake, where players control a snake that grows longer as it eats digital apples. You want your snake to eat the most apples, stay within the boundaries and avoid running into its increasingly bulky body. Such clear right and wrong outcomes give a well-rewarded machine agent positive feedback, so enough attempts can take it from “noob” to High Score.
But suppose the rules change. Perhaps the same agent must play on a larger grid and in three dimensions. While a human player could adapt quickly, the machine can’t, because of two critical weaknesses. First, the larger space means it takes longer for the snake to stumble upon apples, and learning slows exponentially when rewards become sparse. Second, the new dimension provides a totally new experience, and reinforcement learning struggles to generalize to new challenges.
Zhong says we don’t need to accept these obstacles. “Why is it that when we want to play chess” — another game that reinforcement learning has mastered — “we train a reinforcement learning agent from scratch?” Such approaches are inefficient. The agent wanders around aimlessly until it stumbles upon a good situation, such as a checkmate, and Zhong says it requires careful human design to get the agent to know what it means for a situation to be good. “Why do we have to do this when we already have so many books on how to play chess?”
Partly, it’s because machines have struggled to understand human language and decipher images in the first place. For a robot to complete vision-based tasks like finding and slicing carrots, for example, it must know what a carrot is — the image of a thing must be “grounded” in a more fundamental understanding of what that thing is. Until recently, there was no good way of doing that, but a boom in the speed and scale of language and image processing has made the new successes possible.
New natural language processing models allow machines to essentially learn the meaning behind words and sentences — to ground them in things in the world — rather than just store a simple (and limited) meaning like a digital dictionary.
Computer vision has seen a similar digital explosion. Around 2009, ImageNet debuted as a database of annotated images for computer vision research. Today it hosts over 14 million images of objects and places. And programs like OpenAI’s DALL·E generate new images upon command that look human-made, despite having no exact comparison to draw from.
It shows how machines only now have access to enough online data to really learn about the world, according to Anima Anandkumar, a computer scientist at the California Institute of Technology and Nvidia. And it’s a sign that they can learn from concepts as we do and use them for generation. “We are in such a great moment now,” she said. “Because once we can get generation, there is so much more we can do.”
Gaming the System
Researchers like Zhong decided machines didn’t have to embark on their explorations wholly uninformed anymore. Armed with sophisticated language models, the researchers could add a pre-training step where a program learned from online information before its trials and errors.
To test the idea, he and his colleagues compared the pre-training to traditional reinforcement learning in five different game-like settings where machine agents interpreted language commands to solve problems. Each simulated environment challenged the machine agent uniquely. One asked the agent to manipulate items in a 3D kitchen; another required reading text to learn a precise sequence of actions to fight monsters. But the most complicated setting was a real game, the 35-year-old NetHack, where the goal is to navigate a sophisticated dungeon to retrieve an amulet.
For the simple settings, automated pre-training meant simply grounding the important concepts: This is a carrot, that is a monster. For NetHack, the agent trained by watching humans play, using playthroughs uploaded to the internet by human players. These playthroughs didn’t even have to be that good — the agent only needed to build intuition for how humans behave. The agent wasn’t meant to become an expert, just a regular player. It would build intuition by watching — what would a human do in a given scenario? The agent would decide what moves were successful, formulating its own carrot and stick.
“Through pre-training, we form good priors for how to associate language descriptions with things that are happening in the world,” Zhong said. The agent would play better from the start and learn more quickly during subsequent reinforcement learning.
As a result, the pre-trained agent did outperform the traditionally trained one. “We get gains across the board in all five of these environments,” Zhong said. Simpler settings only showed a slight edge, but in NetHack’s complicated dungeons, the agent learned many times faster and reached a skill level that the classic approach couldn’t. “You might be getting a 10x performance because if you don’t do this, then you just don’t learn a good policy,” he said.
“These generalist agents are a big leap from what standard reinforcement learning does,” Anandkumar said.
Her team also pre-trains agents to get them to learn more quickly, achieving significant progress on the world’s bestselling video game, Minecraft. It’s known as a “sandbox” game, meaning it gives players a virtually infinite space in which to interact and create new worlds. It’s futile to program a reward function for thousands of tasks individually, so instead the team’s model (“MineDojo”) built its understanding of the game by watching captioned playthrough videos. No need to codify good behavior.
“We are getting automated reward functions,” Anandkumar said. “This is the first benchmark with thousands of tasks and the ability to do reinforcement learning with open-ended tasks specified through text prompts.”
Beyond Games
Games were a great way to show that pre-training models could work, but they’re still simplified worlds. Training robots to handle the real world, where the possibilities are practically endless, is much harder. “We asked the question: Is there something in between?” Narasimhan said. So he decided to do some online shopping.
His team created WebShop. “It’s basically like a shopping butler,” Narasimhan said. Users can say something like “Give me a Nike shoe that’s white and under $100, and I want the reviews to state that they’re very comfortable for toddlers,” and the program finds and buys the shoe.
As with Zhong’s and Anandkumar’s games, WebShop developed an intuition by training with images and text, this time from Amazon pages. “Over time, it learns to understand the language and map it to actions it has to take on the website.”
At first glance, a shopping butler may not seem that futuristic. But while a cutting-edge chatbot can link you to a desired sneaker, interactions like placing the order require a wholly different skill set. And even though your bedside Alexa or Google Home speakers can place orders, they rely on proprietary software that carries out preordained tasks. WebShop navigates the web the way people do: by reading, typing and clicking.
“It’s a step closer toward general intelligence,” Narasimhan said.