summary: Researchers developed Natural Language Embedded Programs (NLEPs) to enable AI models to solve complex tasks by generating and running Python programs.
This approach improves accuracy and transparency of inference tasks by allowing users to inspect and correct the code, and NLEP also enhances data privacy by processing information locally.
Key Facts:
- NLEP encourages AI to write Python programs to solve complex tasks.
- This approach allows for greater accuracy, transparency, and code inspection.
- NLEP enhances data privacy by processing information locally.
sauce: Massachusetts Institute of Technology
Large-scale language models like those that power ChatGPT have demonstrated outstanding performance on tasks such as creating legal summaries, sentiment analysis of customer reviews, and translating documents into different languages.
These machine learning models typically only use natural language to process information and answer queries, which can make it difficult for them to perform tasks that require numerical or symbolic reasoning.
For example, a large language model might be able to memorize and recite a list of recent US presidents and their birthdays, but the same model might fail to answer the question, “Which US presidents elected since 1950 were born on a Wednesday?” (The answer is Jimmy Carter).
Researchers at MIT and elsewhere have proposed a new technique that allows large-scale language models to generate programs to solve tasks in natural language, mathematics, data analysis, and symbolic reasoning.
Their approach, called Natural Language Embedded Program (NLEP), involves having the language model create and run Python programs to solve user queries and output the solutions as natural language.
The researchers found that NLEP enables large language models to achieve higher accuracy across a wide range of inference tasks, and the approach is also generalizable, allowing a single NLEP prompt to be reused for multiple tasks.
NLEP also increases transparency, because users can check their programs to see exactly how the model reasoned about a query and fix them if the model gave the wrong answer.
“We want to enable AI to perform complex inference in a transparent and trustworthy way. We still have a long way to go, but we've shown that combining programming and natural language capabilities in large-scale language models could be a very good first step toward a future where people can fully understand and trust what's going on inside AI models,” said Hongyin Luo (Class of 2022), a postdoc at MIT and co-first author of the NLEP paper.
Luo's contributions on the paper include Tianhua Zhang, a graduate student at the Chinese University of Hong Kong; Jiaxin Ge, an undergraduate student at Peking University; Yoon Kim, an assistant professor in the MIT Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and lead author James Glass, principal investigator and head of the Spoken Language Systems Group at CSAIL. The research will be presented at the annual conference of the North American chapter of the Association for Computational Linguistics.
Programmatic problem solving
Many common large-scale language models work by predicting the next word or token given natural language input. Models such as GPT-4 can be used to create programs, but because they embed those programs within natural language, there is the potential for errors in the program's reasoning and results.
For NLEP, the MIT researchers took the opposite approach: They instructed the model to generate a step-by-step program entirely in Python code, and then embedded the necessary natural language within the program.
NLEP is a problem-solving template that consists of four steps: first, the model calls the packages or functions needed to solve the task, and in step 2, it imports natural language representations of the knowledge needed for the task (such as a list of birthdays of US presidents).
In step 3, the model implements a function to calculate the answer, and in the final step, the model outputs the results as lines of natural language and automatically visualizes the data, if desired.
“It's like a digital calculator that will always give you the right results if the program is correct,” Luo says.
Users can easily inspect their programs and fix errors directly in the code, without having to re-run the entire model to troubleshoot.
This approach is also more efficient than other methods: if a user has many similar questions, they can generate one core program and substitute certain variables without having to run the model repeatedly.
To get the model to generate NLEPs, the researchers give it general instructions to write a Python program and provide it with two NLEP examples (one math, one natural language) and one test problem.
“Usually when you do a few prompts like this, you have to design a prompt for every task. Because this isn't a prompt that teaches LLMs to solve one problem, but a prompt that teaches LLMs to write programs and solve many problems, we found that one prompt can be used for many tasks,” Luo said.
“Having a language model reason in code opens up a lot of opportunities, including the use of tools, validation of output, and a more structured understanding of how the model works and thinks,” said Leonid Karlinskiy, principal scientist at the MIT-IBM Watson AI Lab.
“There's no magic here.”
NLEP achieved over 90 percent accuracy when having GPT-4 solve a range of symbolic reasoning tasks, such as tracking shuffled objects and playing the game 24, as well as instruction following and text classification tasks.
The researchers found that NLEP demonstrated 30 percent greater accuracy than task-specific prompting methods, which in turn showed improvement over the open-source LLM.
In addition to improving the accuracy of large-scale language models, NLEP can also improve data privacy: Because NLEP programs run locally, sensitive user data doesn't need to be sent to companies like OpenAI or Google to be processed by their models.
Additionally, NLEP allows you to improve the performance of small language models without having to retrain the model for a specific task, which can be a costly process.
“There's no magic here. There are no more expensive, elaborate language models. What we do is just use program generation instead of natural language generation, which can improve performance significantly,” Luo said.
However, because NLEP relies on the model's program generation capabilities, the technique does not work well for small models trained on limited datasets.
Going forward, the researchers plan to explore how smaller language models can generate more effective NLEPs. Additionally, they hope to explore the impact of prompt variation on NLEPs to make the model's inference process more robust.
Funding: This research was supported by the Hong Kong Centre for Perceptual and Interactive Intelligence.
About this AI research news
author: Adam Zewe
sauce: Massachusetts Institute of Technology
contact: Adam Seewe – MIT
image: Image courtesy of Neuroscience News
Original Research: The findings were presented at the annual conference of the North American chapter of the Association for Computational Linguistics.
