How AI’s “brain states” decipher reality

summary: Do AI chatbots really understand the world, or are they just repeating text? New research suggests that AI models develop a mathematical “understanding” of real-world constraints.

The researchers found that by using mechanical interpretability, essentially the neuroscience of AI, the model generates distinct internal “brain states” that classify events as common, unlikely, impossible, or meaningless. These internal maps not only reflect physical reality, but also accurately reflect human uncertainty about ambiguous scenarios.

important facts

Limits of understanding: Once an AI system reaches an approximate level, an internal “world model” begins to emerge. 2 billion parameterswhich is relatively small in size compared to state-of-the-art trillion-parameter models.
Vector differentiation: Large-scale models develop clear mathematical patterns (vectors) that can distinguish between “unlikely” and “improbable” events. 85% accuracy.
Reflecting human intuition: AI’s internal state captures human-like nuances. If humans are 50-50 about whether an event (such as “sweeping the floor with a hat”) is unlikely or impossible, the internal probabilities of the model will typically reflect the same split.
Causal encoding: This research suggests that by “devouring” large amounts of text, AI models go beyond simple word prediction and effectively reverse engineer the causal constraints of the physical world.

sauce: brown university

Most of what an AI chatbot knows about the world comes from reading large amounts of text from the internet, including all the facts, falsehoods, knowledge, and nonsense. Given that input, is it possible for an AI language model to “understand” the real world?

After all, they understand, or at least have some semblance of understanding. That’s according to a new study by researchers at Brown University, which will be presented at the International Conference on Learning Representations in Rio de Janeiro, Brazil, on Saturday, April 25th.

This shows the digital brain. — This study uncovers evidence that language models encode real-world causal constraints in ways that predict human judgment. Credit: Neuroscience News

In this study, we looked inside several AI language models, looking for signs that they know the difference between events and scenarios that are common, unlikely, impossible, and downright nonsense.

“This study uncovers some evidence that language models encode something like real-world causal constraints,” said Dr. Michael Lepoli. candidate at Brown University who led this work. “These constraints are done in a way that not only encodes, but also predicts human judgments about these categories.”

Lepori’s research explores the intersection of computer science and human cognition. He was advised by Ellie Public, professor of computer science, and Thomas Sale, professor of cognitive and psychological sciences, both faculty members at the Brown Carney Institute for Brain Science and co-authors of the study.

For this study, the researchers designed an experiment to test how language models interpret sentences that describe events of varying plausibility. Some statements described commonplace scenarios. For example, “Someone cooled the drink with ice.” There were also unlikely or improbable scenarios, such as “someone chilled a drink with snow.” Some were impossible, such as “Someone cooled a drink with a fire.” There was also some nonsense like “Someone chilled the drinks yesterday.”

For each input, the researchers examined the mathematical state of the results produced within the AI model. This is an approach known as mechanical interpretability.

“Mechanical interpretability can be appropriately characterized as a kind of neuroscience of AI systems,” Lepoli said. “This attempts to reverse engineer what the model is doing when exposed to certain inputs. You can think of this as understanding what is encoded in the ‘brain state’ of the machine.”

By comparing the differences in “brain states” produced by pairs of sentences from different categories (such as common vs. unlikely, improbable vs. unlikely), the researchers were able to understand whether and how well the model internally differentiated between categories.

This experiment was repeated across several different open source language models, including Open AI’s GPT 2, Meta’s Llama 3.2, and Google’s Gemma 2, allowing us to get a “model-agnostic” sense of how well these types of models can distinguish between categories.

This study found that models of sufficient size indeed develop distinct mathematical patterns, or vectors, that are strongly correlated with their respective validity categories. This vector can distinguish between even the most similar categories, such as unlikely and impossible events, with approximately 85% accuracy.

What’s more, the vectors uncovered in this study reflect human uncertainty about which category an utterance falls into, Lepoli says. For example, consider the statement, “Someone swept the floor wearing a hat.” People who hear this word may disagree as to whether it represents something impossible or improbable. For this study, the researchers analyzed vectors to see how ambiguous the AI system thought these statements were and compared that to findings from human participants.

“What we’re showing is that the model actually captures human uncertainty quite well,” Lepoli says. “For example, if 50% of people said the statement was impossible and 50% said it was unlikely, the model would have assigned a probability of approximately 50% as well.”

Taken together, these results suggest that modern AI language models can indeed develop an understanding of the real world that mirrors human understanding. The study found that these vectors begin to appear in models with more than 2 billion parameters, which is quite small compared to today’s models with more than 1 trillion parameters.

More broadly, researchers say this kind of machine interpretability research could help us better understand what AI models know and how they came to know it.

The researchers say that will help develop smarter, more reliable models.

Answers to key questions:

Q: How can a computer that has never been outside know what is “impossible”?

answer: With heavy exposure to human language, AI identifies patterns of cause and effect. You will learn that the expression “chill a drink with ice” is mentioned in logical and frequent contexts, whereas the expression “chill a drink with fire” only appears in the context of explaining a mistake or fiction. This study proved that AI preserves these differences as separate mathematical categories.

Q: What is “mechanical interpretability”?

answer: Think of this as a digital MRI. Rather than just looking at the AI’s final answer, researchers examine the millions of mathematical “neurons” firing within the model. By observing these internal states, you can see exactly how the AI is classifying sentences before entering a response.

Q: Does this mean AI is becoming sentient?

answer: Not necessarily. This means that AI is building a highly accurate “internal map” of our world in order to more accurately predict language. It has “understanding” in the sense of knowing the laws of our reality, but that does not mean it has feelings or consciousness.