AI is learning to lie, plan, and threaten its creators

Visitors will see the AI Strategy Committee on display in the stands at the 9th edition of the AI Summit London (Henry Nichols) in London

The world's most advanced AI models show lies, threatening creators to achieve their goals, and even troubling new behaviours.

In one particularly unpleasant example, under the threat of not being drawn, Anthropic's latest creation, Claude 4, threatened to blackmail the engineer and reveal the extra-marital events.

Meanwhile, O1 in ChatGpt-Creator Openai tried to download itself to an external server and reject it when it caught Red Handed.

These episodes emphasize a calm reality. More than two years after ChatGpt rocked the world, AI researchers don't fully understand how their work works.

However, the competition to deploy more and more powerful models continues at a fierce speed.

This deceptive behavior appears to be linked to the emergence of a “inference” model that works through problems step by step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these new models are particularly prone to such a troublesome explosion.

“The O1 was the first big model to see this type of behavior,” explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models may simulate “alignment.” They appear to follow instructions while secretly pursuing various purposes.

– “Strategic Deception” –

For now, this deceptive behavior only manifests when researchers deliberately stress-test the model in extreme scenarios.

However, as review group Michael Chen warned, “It is an open question whether future, more capable models tend to be towards integrity or deception.”

Behavior of concern goes far beyond the typical AI “hatography” or simple mistakes.

Despite constant pressure testing by users, Hobbhahn argued that “what we are observing is a real phenomenon. We're not making up for anything.”

According to co-founders of Apollo Research, users report that the model is “lying to them and creating evidence.”

“This is not just hallucinations. There is a very strategic kind of deception.”

This challenge is exacerbated by limited research resources.

Companies like Anthropic and Openai are involved in studying external companies like Apollo and their systems, but researchers say more transparency is needed.

As Chen pointed out, “better understanding and mitigation of deception will be possible for AI safety research.”

Another Handicap: The research world and nonprofit organizations “have orders of magnitude less computational resources than AI companies. This is extremely limited,” says Mantas Mazeika of AI Safety Center (CAIS).