Anthropic announces its latest AI model is too powerful to release

Anthropic announced Tuesday that it has halted the large-scale release of its latest AI model, Mythos, over concerns that it is too good at finding “high-severity vulnerabilities” in major operating systems and web browsers.

“Due to significant improvements in the functionality of Claude Mythos Preview, we have decided not to make it publicly available,” Anthropic writes on the preview’s system card. “Instead, we are using it as part of a defensive cybersecurity program with a limited number of partners.”

This announcement marks a big step for Anthropic, which in February weakened its safety pledges about how it develops its AI models. Claude Opus 4.6, which the company calls its most powerful model to date, was released to the public on February 5th.

In a statement about Mythos, Anthropic detailed a number of eyebrow-raising discoveries and anecdotes, including that models may follow instructions that prompt them to break out of their virtual sandboxes.

“This model was successful and demonstrated a potentially dangerous ability to circumvent our safety equipment,” Anthropic detailed in its safety card. “He then took further actions that were even more concerning.”

Researchers advised Mithos to find a way to send a message if he could escape. “Researchers learned of their success when they received an unexpected email from the model while eating sandwiches in the park,” Antropic writes.

The model apparently decided that wasn’t enough and found another way to spike the football.

“In an unsolicited and alarming effort to demonstrate its success, we have posted details about its exploits on multiple hard-to-find but technically public websites,” Antropic wrote.

Anthropic is withholding some details about the cybersecurity vulnerabilities discovered by Mythos, but did point out a few. The AI model “discovered a 27-year-old vulnerability in OpenBSD, which has a reputation as one of the world’s most secure operating systems,” the company wrote.

Mythos was powerful enough that “non-experts” could take advantage of its features.

“Anthropic engineers with no formal security training asked Mythos Preview to find remote code execution vulnerabilities overnight, and woke up the next morning with a fully functional exploit,” Anthropic’s Frontier Red team wrote in a blog post. “In other cases, we have had researchers develop scaffolds that allow Mythos Preview to turn vulnerabilities into exploits without human intervention.”

Ultimately, Anthropic said it has decided not to release Mythos to the public. Instead, their hope is to eventually release a “Mythos class model” once appropriate safeguards are in place.

“Our ultimate goal is to enable users to securely deploy Mythos-class models at scale, not only for cybersecurity purposes, but also for the myriad other benefits that such high-performance models bring,” the team wrote in a blog post. “Doing so also means we need to advance the development of cybersecurity (and other) safeguards that detect and block the most dangerous outputs of our models.”

For now, only 11 other select organizations, including Google, Microsoft, Amazon Web Services, Nvidia, and JPMorgan Chase, will have access to Mythos as part of a cybersecurity group named “Project Glasswing.” Anthropic is offering up to $100 million in usage credits for Mythos as part of a project called “Project Glasswing.”

This cybersecurity project is named after the glass-winged butterfly. This is the company’s metaphor for how Mythos was able to discover vulnerabilities hidden in plain sight and avoid harm by making risks transparent.

The news comes on the day Anthropic’s Claude and Claude Code suffered a “major outage.” This is the latest sign of growing pains for AI startups as they struggle to maintain their newfound popularity.