AI Fundamentals | OpenAI’s latest AI models report high “hallucination” rates. What does this mean and why is it important?

A technical report released last month by artificial intelligence (AI) research organization OpenAI found that the company’s latest models, the o3 and o4-mini, generate more errors than older models. Computer scientists refer to errors caused by chatbots as “hallucinations.”

The report revealed that OpenAI’s most powerful system, o3, hallucinated 33% of the time while running the PersonQA benchmark test, which required answering questions about famous people. In o4-mini, 48% had hallucinations.

Even worse, OpenAI He said he doesn’t even know why these models have more hallucinatory symptoms than previous models.

Here we take a look at what AI hallucinations are, why they occur, and why a new report on OpenAI’s models is important.

What is AI hallucination?

When the term AI hallucinations started being used to refer to errors made by chatbots, the definition was very narrow. This was used to refer to cases where an AI model gives fabricated information as output. For example, in June 2023, a U.S. attorney admitted using ChatGPT to prepare court filings because the chatbot added false citations to submissions pointing to non-existent cases.

Today, hallucination has become an umbrella term for different types of mistakes made by chatbots. This includes cases where the output is factually correct but not relevant to the question actually asked.

Why do AI hallucinations occur?

ChatGPT, o3, o4-mini, Gemini, Perplexity, Grok, etc. are all examples of what are known as large-scale language models (LLMs). These models essentially take textual input and produce synthetic output in textual form.

Story continues below this ad

The LLM is able to do this because it is built using large amounts of digital text taken from the internet. Simply put, computer scientists feed large amounts of text into these models, which help them identify patterns and relationships within that text, predict text sequences, and generate output in response to user input (called prompts).

Note that LLM is always making inferences when giving its output. They don’t know what is true and what is not. These models can’t even fact-check their output against, say, Wikipedia like humans can.

LLMs “know what words are, and they know which words predict which other words in the context of the word. They know what kinds of words come together in what order, and that’s it. They don’t work like you or me,” scientist Gary Marcus writes in his Substack, Marcus on AI.

As a result, when an LLM is trained on, say, inaccurate text, it will yield inaccurate outputs and hallucinations.

Story continues below this ad

However, even if the text is accurate, it does not prevent LLM from making mistakes. This is because these models combine billions of patterns in unexpected ways to generate new text (in response to prompts). Therefore, there is always a possibility that LLM may give fabricated information as output.

Additionally, because LLMs are trained on vast amounts of data, experts do not understand why a particular sequence of text is produced at a particular moment.

Why is OpenAI’s new report important?

Hallucinations have been a problem in AI models since the beginning, and major AI companies and research institutes initially repeatedly claimed that the problem would be solved in the near future. It seemed possible, as the hallucinations tended to decrease with each update of the model since it was first launched.

However, since the release of a new report on OpenAI’s latest model, it has become increasingly clear that the illusion still exists. This issue is also not limited to OpenAI. Other reports say that Chinese startup DeepSeek’s R-1 model has a double-digit increase in hallucination rates compared to the company’s previous model.

Story continues below this ad

This means that the application of AI models should be limited, at least for now. For example, it cannot be used as a research assistant (because the model creates fake citations in research papers) or a paralegal bot (because the model presents fictitious legal cases).

Computer scientists like Arvind Narayanan, a professor at Princeton University, believe that hallucinations are, to some extent, inherent in the way LLMs work, and that as these models become more capable, people will use them for more difficult tasks with higher failure rates.

In a 2024 interview, he said: time Magazine: “There is always a line between what people want to use” [LLMs] It is as much a sociological problem as it is a technical one. And I don’t think there’s a clear technical solution to that. ”

Source link