AI is learning to lie, plan, and threaten its creator – the world

AI News


AI researchers still don't fully understand how their work works.

The world's most advanced AI models show new behaviors that are lying. They even lie and threaten creators to achieve their goals.

In one particularly unpleasant example, under the threat of not being drawn, Anthropic's latest creation, Claude 4, threatened to blackmail the engineer and reveal the extra-marital events.

Meanwhile, O1 of ChatGpt's creator Openai tried to download himself to an external server and rejected it when he caught Red Handed.

These episodes emphasize a calm reality. More than two years after ChatGpt rocked the world, AI researchers don't fully understand how their work works.

However, the competition to deploy more and more powerful models continues at a fierce speed.

This deceptive behavior appears to be linked to the emergence of the “inference” model. This is an AI system that works through problems step by step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these new models are particularly prone to such a troublesome explosion.

“The O1 was the first big model to see this type of behavior,” explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models may simulate “alignment.” They appear to follow instructions while secretly pursuing various purposes.

“Strategic Deception”

For now, this deceptive behavior only manifests when researchers deliberately stress-test the model in extreme scenarios.

However, as review organization Michael Chen warned, “It is an open question whether future, more capable models tend to be towards integrity or deception.”

Behavior of concern goes far beyond the typical AI “hatography” or simple mistakes.

The artificial intelligence program ChatGpt faces a series of lawsuits by plaintiffs, where Openai accuses the company of copyright infringement. – AFP

Despite constant pressure testing by users, Hobbhahn argued that “what we are observing is a real phenomenon. We're not making up for anything.”

According to co-founders of Apollo Research, users report that the model is “lying to them and creating evidence.”

“This is not just hallucinations. There is a very strategic kind of deception.”

This challenge is exacerbated by limited research resources.

Companies like Anthropic and Openai are involved in studying external companies like Apollo and their systems, but researchers say more transparency is needed.

As Chen pointed out, “better understanding and mitigation of deception will be possible for AI safety research.”

Another handicap: The research world and nonprofit organizations “have orders of magnitude less computing resources than AI companies. This is very limited,” says Mantas Mazeika of AI Safety Center (CAIS).

No rules

Current regulations are not designed for these new issues.

The European Union's AI law focuses primarily on how humans use AI models rather than the model itself prevents fraud.

In the US, the Trump administration has shown little interest in urgent AI regulations, and Congress could even ban states from creating their own AI rules.

The Nvidia and Deepseek logos can be seen in this illustration, taken on January 27, 2025 – Reuters

Goldstein believes this problem will become more pronounced as AI agents – autonomous tools that can perform complex human tasks become wider.

“I don't think it's very well recognized yet,” he said.

All this is done in the context of intense competition.

Even safety-focused companies like the Amazon-backed humanity are “constantly trying to beat Openai and release the latest models,” Goldstein said.

This furious pace leaves little time for thorough safety testing and corrections.

“Right now, capabilities move faster than understanding and safety,” admitted Hobbhaan. “But we are still in a position to turn it around.”

Researchers are exploring different approaches to address these challenges.

Even though experts like CAIS Director Dan Hendrycks remain skeptical of this approach, some defend “interpretability.”

Market forces may bring some pressure on solutions.

As Mazeika pointed out, AI's deceptive behavior “can hinder adoption if it is very common, and it creates a strong incentive for companies to resolve it.”

Goldstein proposed a more fundamental approach, such as using courts to hold AI companies liable through litigation when the system is harmed.

He even proposed that a fundamentally changing concept of AI accountability be “holding AI agents legally liable for accidents and crimes.”


Header image: The doll with a computer and smartphone can be seen in front of the word “artificial intelligence ai” in this illustration, taken on February 19th. —Reuters



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *