In 1950, British mathematician Alan Turing proposed a seemingly simple challenge. The question is: Can a machine have such human-like conversations that humans can’t reliably tell the difference? For decades, the answer was no.
A new study from the University of California, San Diego suggests that the answer is a quiet yes, and that the impact deserves more attention than it gets.
This study is the first to rigorously test modern AI language models using Turing’s unique framework.
They also found, for the first time, that these models were judged to be human as often as, and in some cases much more often than, actual humans.
Experiment situation
The setup was intentionally simple. A human “interrogator” had to simultaneously converse via text with two other parties (one real human, one AI) and determine which was which.
Conversations lasted 5 or 15 minutes. Nearly 500 people participated across two groups: UC San Diego undergraduate students and a broader online sample.
Four AI systems were tested. The two most advanced models were GPT-4.5 and LLaMa-3.1-405B.
For comparison, the researchers also included GPT-4o and ELIZA. ELIZA is a rules-based chatbot from the 1960s that has been used as a reference point in AI research ever since.
AI can now deceive people
The results were amazing. GPT-4.5 was identified as human 73% of the time.
This means that the interrogator is more likely to choose that person as an actual person than to choose the actual person sitting on the other side of the conversation.
LLaMa-3.1-405B was determined to be human 56% of the time and was statistically indistinguishable from its human counterpart.
It was far behind the old system, with ELIZA and GPT-4o only being identified as humans 23% and 21% of the time, respectively.
Character is more important than intelligence
It was personality, not intelligence, that fooled people. Successful AI models won not by being smarter or more knowledgeable than humans, but by acting more like humans.
Each model was given a “persona” prompt that instructed them to adopt a specific human personality and communication style.
This instruction made a big difference. Without this instruction, the success rate of GPT-4.5 decreased from 73% to 36%, and LLaMa-3.1 decreased from 56% to 38%.
A convincing human model
Lead study author Cameron Jones conducted the research while earning his doctorate in cognitive science at the University of California, San Diego, and is currently an assistant professor of psychology at Stony Brook University.
“What we found is that given the right prompts, advanced LLMs can display the same tone, directness, humor, and fallibility as humans,” Jones says.
“We know that LLM can easily generate knowledge on almost any topic, but in this test we also showed that LLM can convincingly display social behavioral traits. This has major implications for the way we think about AI.”
Models can behave in convincingly human-like ways, but mostly if they are told exactly how to do so. Even if I left it alone, it wasn’t very convincing.
“They have the ability to look like humans, but they probably don’t have as much of an ability to understand what it takes to look like humans,” says co-author Ben Bergen, a professor of cognitive science at the University of California, San Diego.
Turing test measurement content
Seventy-six years after Turing first asked the question, it turns out that the test was measuring something quite different than what he originally intended.
“The Turing test started as a way to ask whether a machine could match human intelligence,” Bergen says.
“But we now know that AI can answer many questions faster and more accurately than humans, so the real problem is not one of raw brains.”
“Seeing that machines can pass tests, and how machines pass tests, forces us to rethink what machines measure. Increasingly, they measure humanness.”
Raw intelligence – answering questions, solving problems, and processing information – is what we know AI can do.
What’s even newer and stranger is AI that can imitate human textures, such as hesitations, jokes, and the sense that there’s a human being on the other end of the conversation.
How this will change online behavior
The practical implications are unpleasant. These models fail the Turing test in carefully controlled laboratory conditions far removed from everyday life.
They convey it in the length and type of conversations that take place online all the time: five-minute exchanges and 15-minute chats.
“It’s relatively easy to make these models indistinguishable from humans,” Jones says. “We need to be more vigilant. When interacting with strangers online, people should be less confident that they are talking to a human being and not an LLM.”
“The Turing Test is a game of lying for the model. One implication is that the model seems to be good at it.”
“There are a lot of people who want to use bots to share their Social Security numbers to persuade people to vote for their party or buy their products,” Bergen added.
This does not mean that AI passing the Turing test is purely bad news. Researchers are careful not to be taken that way.
However, this means that a feature that many expected in the future that many people would still be comfortable using has arrived.
The research will be published in a journal Proceedings of the National Academy of Sciences.
—–
Like what you read? Subscribe to our newsletter for fascinating articles, exclusive content and the latest updates.
Check us out on EarthSnap, the free app from Eric Ralls and Earth.com.
—–
