Large language models do not behave like humans.

One of the reasons large language models (LLMs) are so powerful is the variety of tasks they can be applied to: the same machine learning models that help graduate students write emails can also help clinicians diagnose cancer.

However, the broad scope of these models makes them difficult to evaluate systematically: it is not possible to create benchmark datasets to test the models against every type of question that might be asked.

In a new paper, MIT researchers take a different approach: They argue that because humans decide when to deploy large-scale language models, to evaluate them we need to understand how people form beliefs about how they work.

For example, graduate students will need to determine whether a model would be helpful when crafting a particular email, and clinicians will need to determine in which cases it would be best to refer to a model.

Building on this idea, the researchers created a framework to evaluate LLMs based on their alignment with people's beliefs about how they would perform on a given task.

They introduce a person-generalization function, a model of how people update their beliefs about the features of LLMs after interacting with them, and then evaluate how well LLMs match this person-generalization function.

Their findings show that when a model does not match humans' generalization capabilities, users may be over- or under-confident about where to deploy the model, which can result in the model failing unexpectedly. Moreover, this mismatch tends to cause more capable models to perform worse than less capable models in critical situations.

“These tools are interesting because they are so versatile, but because they are so versatile they come at the expense of working with humans, so we need to consider the role of humans in that,” said study co-author Ashesh Rambachan, assistant professor of economics and principal investigator at the Institute for Information and Decision Systems (LIDS).

The paper's lead author, Keone Vafa, a postdoctoral researcher at Harvard, and Sendhil Mullainathan, a professor in the Departments of Electrical Engineering and Computer Science and Economics at MIT and a member of the LIDS team, also contributed to the study. The research will be presented at the International Conference on Machine Learning.

Generalization of humans

As we interact with other people, we form beliefs about what they know and don't know. For example, if a friend is picky about correcting people's grammar, we might generalize and assume that they're also good at sentence structure, even though we've never asked them about it.

“Language models often seem very human, and we wanted to show that this power of human generalization is also present in the way people form beliefs about language models,” Rambachan says.

As a starting point, the researchers formally defined a human generalization function that asks a question, observes how a person or LLM responds, and then infers how that person or model will respond to a related question.

One might assume that if an LLM can correctly answer questions about matrix inversions, it could also answer simple arithmetic questions. Models that are not consistent with this capability – that is, models that do not perform well on questions that we would expect humans to answer correctly – are likely to fail when deployed.

Given that formal definition, the researchers designed a survey to measure how people generalize when interacting with LLMs and with other people.

The researchers showed study participants questions that a person or LLM answered correctly or incorrectly, and then asked them whether they thought that person or LLM would answer a related question correctly. Through the study, the researchers generated a dataset of nearly 19,000 examples that show how humans generalize about LLM performance across 79 diverse tasks.

Measuring deviation

Although participants did quite well when asked whether a person who answered one question correctly would also answer related questions, they proved quite poor at generalizing the performance of LLMs.

“Human generalizations are applied to language models, but they don't work well because these language models don't actually exhibit patterns of expertise the way humans do,” Rambachan says.

People were also more likely to update their beliefs about the LLM when it answered a question incorrectly than when it answered the question correctly, and people were more likely to think that the performance the LLM showed on easy questions had little effect on its performance on more complex questions.

In situations where people placed more weight on incorrect answers, simpler models performed better than very large models like GPT-4.

“As language models get better, it's almost possible to trick people into thinking they're going to perform well on related questions when in fact they aren't,” he says.

One explanation for why humans are poor at generalizing about LLMs could be the newness of LLMs: humans have much less experience interacting with them than other people.

“Going forward, we may be able to make even more progress simply by interacting more with the language model,” he says.

To this end, the researchers hope to conduct additional research into how people's beliefs about LLMs change over time as they interact with the models. They also want to explore how human generalization can be incorporated into the development of LLMs.

“When we're training these algorithms in the first place, or trying to update them with human feedback, we need to take into account human generalization when thinking about how to measure performance,” he says.

Meanwhile, the researchers hope that the dataset can be used as a benchmark to compare how LLM performs in relation to human generalization capabilities, helping to improve the performance of models deployed in real-world situations.

“To me, the contribution of this paper is twofold. The first is practical: it reveals a significant problem in deploying LLMs for mass consumer use. Failure to properly understand when LLMs are accurate and when they fail will make mistakes more noticeable and perhaps discourage further use. This highlights the problem of aligning models with people's understanding of generalization,” says Alex Imus, a professor of behavioral science and economics at the University of Chicago Booth School of Business, who was not involved in the research. “The second contribution is more fundamental: the lack of generalization to expected problems or domains helps us get a better picture of what the model is doing when it solves a problem 'right'. This is a test of whether LLMs 'understand' the problem they are trying to solve.”

The research was funded in part by the Harvard Data Science Initiative and the Center for Applied AI at the University of Chicago Booth School of Business.

Source link