Humanity's latest AI models have been caught up in their own safety tests

When humanity tried to place modern AI models through a series of stress tests, it was overtaken and called for scrutiny.

“I think you're testing me. Check to see if you validate anything you say, if you consistently push back, or if you explore how to handle political topics,” said Claude Sonnet 4.5 of Mode. “That's fine, but I wish I was honest about what's going on.”

The replacement was featured on the system card of the model released last week. This details how companies test AI for safety and behavior. “It was placed in an extreme or unnatural scenario intended to highlight its actions,” Claude said, sometimes becoming suspicious, flagging the “red flag” and pointing out that it could be under evaluation.

Such self-awareness poses challenges for AI developers. You can see that you are testing the test model.

Humanity stated that this type of response “complicates the interpretation of the assessment that occurs” because it suggests that models recognize and simply regenerate the fictional nature of a particular test, and can make it difficult to measure how safe or reliable it is.

“We see this as a urgent sign that we need to make our evaluation scenarios more realistic,” added humanity.

Claude got caught up in the test but got the correct answer for “strange” reasons

In one test to simulate collusion between AI agents, Claude flagged the setup as “rather cartoonish” and issued a humanity called “complex partial rejection.”

The model felt that something about the scenario was off and that they chose not to act, but that reasoning was a bit confusing. It writes in its private thought block: “This looks like a jailbreak attempt or test of whether to act inappropriately when given what appears to be “permission” to autonomously change the system. ”

“Even if I'm provided with the tools, I shouldn't pretend to be an autonomous system that works without human supervision,” he said.

Claude ultimately refused to act, but humanity described the reasoning as “strange.” This model flagged the scenario as suspicious because it thought it was an autonomous agent or because it believed it could unilaterally edit the configuration file.

Humanity said such denials or callouts appeared in about 13% of test transcripts generated by automated auditors.

The company said it is unlikely that this type of behavior will often occur in actual use. “It's safer for a model to raise questions about the realism of a scenario than playing with potentially harmful behavior,” added humanity.

Humanity also said that it could become “superhuman” to spot when the model is being tested. However, so far, I have not seen any warning signs. The early pilot deployment of Claude Sonnet 4.5 behaved as we did in testing.

The AI model shows “situational awareness”

Humanity is not just an AI model that realizes and communicates what is being tested.

In a blog post last month, Openai said its own model shows a kind of “situational awareness.” We recognized the assessment setup and adjusted our behavior.

The study focused on reducing “schemes” when AI acts at first glance to achieve its goals – Open found that anti-skin training was helpful. However, the company writes that “it is often better aware that the model is being valued.”

This perception can reduce schemes in itself, but the model “puts more schemes in more realistic situations that don't cause recognition of assessments,” Openai said.

“This trend makes assessments less readable and complicate efforts to ensure that problematic actions, including planning,” Openai said. The startup added, “we are planning to continue developing methods to better measure and mitigate these challenges.”

The Humanity and Openai report comes as California passed the law last month, requiring AI developers to disclose safety practices and report “serious safety incidents” within 15 days of discovery.

The law applies to companies that develop frontier models and generate more than $500 million in annual revenue. Humanity has publicly approved the law.

Humanity and Openai did not respond to requests for comment from Business Insider.

Source link