summary: AI chatbots often overestimate their capabilities and fail to adjust even when they perform poorly, new research finds. Researchers can compare human and AI's confidence in trivia, prediction, and image recognition tasks and show humanity, but AI often grows overconfident.
One model, Gemini, performed the worst, but believes it did its best, indicating the lack of metacognitive awareness in current AI systems. The findings highlight why users should question AI's confidence and why developers need to address this blind spot.
Important facts:
- Overconfidence: AI chatbots often exaggerate their accuracy, even when they're wrong.
- There is no self-awareness: Unlike humans, AI cannot adjust confidence after a decline in performance.
- Various results: Models like ChatGPT performed better and approached human calibration.
sauce: Carnegie Mellon University
Artificial intelligence chatbots are everywhere these days, from smartphone apps and customer service portals to online search engines. But what happens when these handy tools overestimate your abilities?
Researchers asked both human participants and four large-scale language models (LLMs) if they were confident in their ability to answer trivia questions, predict NFL games and Oscars ceremonies results, and play picturesque image identification games.
Both people and LLMS tended to be overconfident about how they performed hypothetically. Interestingly, they also answered questions or identified images with relatively similar success rates.
However, a study published in the journal found that when participants and LLM were retrospectively asked how well they thought they had done, it seemed that only humans could adjust their expectations. Memory and cognition.
“They said they said they would get 18 questions correctly, and they ended up getting 15 questions correctly. Usually, their estimates are like 16 correct answers,” at Carnegie Mellon University, in the department of social decision science and psychology.
“So they're probably still going to be a little overconfident, but they're not overconfident.”
“LLMS didn't do that,” said Cash, the lead author of the study. “They tended to gain more confidence if anything, even if they didn't do much about the task.”
The world of AI is changing rapidly every day, making it difficult to draw general conclusions about applications.
However, one strength of this study was that data was collected over two years, meaning it would use a continuous updated version of LLM known as ChatGPT, Bard/Gemini, Sonnet and Haiku. This means that excessive confidence in AI can be detected in different models over time.
“When you say that AI looks a bit like a fish, users may not be skeptical because AI will confidently assert the answer even when its confidence is unfair,” said Danny Oppenheimer, co-author of the research with the Department of Social and Decision Sciences at CMU.
“Humans have evolved over time and have practiced to interpret the cues of confidence given to other humans since birth. If the brow grooves and the answers are slow, you may find that you are not necessarily sure what I am saying, but with AI, there are not many clues as to whether or not they know what it is talking about,” Oppenheimer said.
Ask the correct question to ai
Although the accuracy of LLM in answering Trivia's questions and predicting football match outcomes is a relatively low stake, this study suggests pitfalls associated with integrating these technologies into everyday life.
A recent study conducted by the BBC, for example, found that when LLMS was asked about news, more than half of the responses had “significant issues” that included de facto errors, misdirection of sources, missing or misleading contexts.
Similarly, another study in 2023 found that LLMS produced “hastisation” or misinformation in 69-88% of legal queries.
Clearly, the question of whether AI knows what it's talking about is more important than ever. And the truth is that LLM is not designed to answer everything the users throw at each day.
If you asked, “What is London's population?” AI would have been searching the web, given the perfect answer and full confidence calibration,” Oppenheimer said.
However, by asking questions about future events, such as future Academy Award winners, or on more subjective topics, such as the intended identity of hand-drawn images, researchers were able to expose the obvious weaknesses of chatbot metacognition: their ability to recognize their own thought processes.
“We still don't know exactly how AI estimates confidence,” Oppenheimer said.
This study also revealed that each LLM has its advantages and disadvantages. Overall, LLM, known as Sonnet, tended to be less overconfident than her colleagues. Similarly, CHATGPT-4 was performed in a similar way to human participants in a Pictor-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini was able to identify only 0.93 sketches on average.
Furthermore, Gemini predicted that an average of 10.03 sketches were correct, and even after correctly answering less than one of the 20 questions, the LLM retrospectively estimated that he answered correctly to 14.40, indicating a lack of self-awareness.
“Gemini was straightforward and straight to play Pictory,” Cash said. “But, what's worse, it didn't know that Pictory was bad. It's like that friend who swears they're good in the pool but never will do a shot.”
Building trust with artificial intelligence
For daily chatbot users, Cash said the biggest point is to remember that LLM is not inherently correct and that it is best to ask how confident they are when answering important questions.
Of course, this study suggests that LLMS may not always be able to accurately determine confidence, but if a chatbot acknowledges low confidence, it is a good indication that the answer is unreliable.
Researchers note that chatbots may also be able to better understand their capabilities than very large datasets.
“Maybe if it had thousands or millions of exams, it would do better,” Oppenheimer said.
Ultimately, exposing weaknesses such as overconfidence is only useful for people in the industry who are developing and improving LLMS. Additionally, as AI becomes more advanced, it may develop the metacognition needed to learn from mistakes.
“If we can recursively determine that LLMS is wrong, that fixes a lot of the issues,” Cash said.
“I find it interesting that LLMs often can't learn from their actions,” Cash said.
“And there may be a humanist story that should be told there. Maybe there's something special about the way humans learn and communicate.”
About this LLM and AI research news
author: Abby Simmons
sauce: Carnegie Mellon University
contact: Abby Simmons – Carnegie Mellon University
image: This image is credited to Neuroscience News
Original research: Open access.
“Quantification of uncert-ai-nty: Testing the accuracy of LLMS trust judgments,” Trent Cash et al. Memory and cognition
Abstract
uncert-ai-nty quantification: Test the accuracy of LLMS trust judgments
The rise of large-scale language model (LLM) chatbots such as ChatGpt and Gemini revolutionized the way information is accessed. These LLMs can answer a wide range of questions on almost any topic.
When humans answer questions, particularly difficult or uncertain questions, they often involve their answers with metacognitive trust judgments that indicate their belief in accuracy. While LLM can certainly provide trust judgments, it is currently unclear how accurate these trust judgments are.
To bridge this gap in the literature, the current study investigates the ability of LLMS to quantify uncertainty through trust judgments.
We compare the absolute and relative accuracy of trust judgments made by human participants in both domains of four LLMS (ChatGpt, Bard/Gemini, Sonnet, Haiku) and the booky uncertainty. n = 502) and Oscar predictions (Study 2; n = 109) – and domain of epistemological uncertainty – pillar performance (Study 3; n = 164), trivia questions (Study 4; n = 110), and questions about university life (Study 5; n = 110).
Several commonalities are found between LLM and humans, including achieving similar levels of absolute and relative metacognitive accuracy (although LLM tends to be slightly more accurate in both dimensions). We also see that LLMs tend to be overconfident, like humans.
However, unlike humans, we find that LLM, in particular, cannot adjust confidence judgments based on past performance and cannot emphasize important metacognitive limitations.
