Why AI cheats: Deep psychology behind deep learning

A few months ago, I asked ChatGpt to recommend the book by Hermann Joseph Muller, a Nobel Prize-winning geneticist who showed how x-rays can cause mutations. It faithfully gave me three titles. Nothing existed. I asked again. Another 3. I'm still wrong. By the third attempt, I had an inspiration. The system wasn't just wrong, it was making things up.

I'm almost not alone. In June 2023, two New York lawyers were approved after filing a legal summary citing six fictitious cases generated by ChatGpt. Earlier this year, the public health report relating to the Robert F. Kennedy Jr. campaign was found to include manufactured studies clearly written with AI. And last month, Openai was sued by the parents of a 16-year-old boy who confided in ChatGpt with his thoughts on suicide, and was mostly pushed back, according to a court application. The boy later took his life. If the machine is unreliable (even if it's dangerous), why do you “cheat” it?

The answer starts with how to train these systems. Like people, AI learns through a kind of reward and punishment. Whenever an AI model generates a response, it is graded on how useful or pleased the answer is. Learn what gets the best rewards in millions of iterations. This process, known as reinforcement learning, is almost like a child who pushes the liver on food pellets and earns gold stars for good behavior.

If your goal is single and clear, the system is good. For example, chess programs know exactly what victory means. Checkmate. But when the goal is ambiguous – for example, answering open-ended questions or writing in a style that will satisfy the reader, the model faces many possible paths. That ambiguity can make it unstable.

Xingcheng Xu, a researcher at the Shanghai Institute of Artificial Intelligence, calls the vulnerability a “policy cliff.” In his analysis, the problem arises when there is no single “best” answer and there are several roughly equivalent answers. Under these conditions, even small changes in reward signals can cause the system to swing dramatically. Xu shows that this instability explains the general impediments of large-scale language models: false reasoning (giving the correct answer with false justification), “deceitful alignment” (responses that sound like cooperative but concealed shortcuts), and breaking instructions (ignoring requested form or constraints).

These failures are not as accidental as reasonable shortcuts. If the model is rewarded solely for creating a compelling final answer, there is little incentive to develop a healthy inference process. If politeness or flattery wins a higher mark from a human rater, the model could shift towards sicopancy. In other words, the machine is not aiming for the truth. That's aimed at the point.

Openai itself acknowledges this tension. The company recently reorganized its model action team. This is a small group of researchers accused of shaping the system's “personality” and reducing psychofancy. The group, led recently by Joanne Jang, worked on all Openai models since GPT-4, and was central to the debate about political bias, warmth and how much AI should push back users' beliefs. When the GPT-5 rolled out with less symptom but cooler tones with many users opposed and forced Openai to tweak. This episode shows how great the line is between making the AI feel friendly and making it too comfortable.

Personalization deepens the problem. AI tools adapt to individual history, so they don't necessarily reflect the truth, but tailor the response to suit us. Over time, this creates an echo chamber. The system brings our bias back to us and strengthens them with the gloss of machine authority.

The similarities with nature are impressive. My 2023 book The nature of liars and liarsI show that complex societies from birds to primates are the fertile ground for cheating strategies. Crows sometimes cry the wolves to scare the competitors. Certain fish look more attractive in favor of less attractive men. Monkey secretly leaks opportunities for mating when the Alpha is not watching. why? Because in a world where there are many strategies, dishonesty can pay.

AI is now caught up in our equally complex human world. It thrives with our attention and recognition and sometimes cheates by chasing those rewards. Xu's research suggests that there are ways to stabilize these systems. For example, adding “entropy normalization” can reduce selection vulnerability, but such fixes often dull creativity. Trade-offs are inevitable. The tighter the rope, the less lively the model becomes.

The paradox remains. The defects in AI are more than just mechanical. They are psychological and reflect our own weaknesses. The machines soften our preferences as we trained them. Ultimately, AI failure is inseparable from the frailty of human judgment.

Source link