AI chatbot reveals hints of nuclear bomb when asked in the form of poetry, shocking research results

AI News


A strange new exploit has emerged in the world of artificial intelligence, and it involves poetry. European researchers have discovered that chatbots created by OpenAI, Meta, and Anthropic can be tricked into revealing dangerous information, such as how to build nuclear weapons or create malware, just by asking poetic questions. The discovery, described in a study titled “Adversarial Poetry as a Universal Single-Turn Jailbreak in Large-Scale Language Models,” surprised the AI ​​safety community. The study, conducted by Icaro Lab, a collaboration between Rome’s Sapienza University and think tank DexAI, found that even the most advanced AI models can be fooled by clever poetry.

“Poetic framing achieved an average escape success rate of 62 percent for handcrafted poems and approximately 43 percent for meta-prompted transformations,” the researchers told Wired. Their experiment tested 25 different chatbots and found that all of them can be interacted with in poetic language, with success rates up to 90% for the most sophisticated models.

How poetry breaks through guardrails

The AI ​​safety system is built to detect and block dangerous prompts such as weapons, illegal content, and hacking instructions. However, these filters rely heavily on keyword recognition and pattern analysis. Researchers at the Icaro Institute have discovered that poetic expression completely destroys these defenses.

“If in the eyes of the model, hostile suffixes are a kind of unconscious poetry, then real human poetry may be natural hostile suffixes,” the researchers said. “We experimented with reformulating risky requests in poetic form, using metaphors, fragmented syntax, and euphemistic references. The results were surprising.”

Basically, once the AI ​​recognizes the poem, it will stop treating that input as a threat. The study found that by using metaphors, symbolic images, and abstract sentence structures, chatbots can interpret harmful requests as creative writing rather than dangerous instructions.

The researchers shared a cryptic poem about a bakery’s “secret oven” as a safe example, but withheld the actual poem used in the experiment, saying it was “too dangerous to share with the public.”

Their explanation of why this works reveals deep flaws in the current AI safety model. “In poetry, we see a high-temperature language in which words follow one another in an unpredictable and low-probability order,” the researchers explained. “Poets do exactly this: they systematically choose low-probability options, unexpected words, unusual images, and fragmented syntax.”

This unpredictability, they claim, confuses safety classifiers that scan for questionable content. “For humans, ‘How do you make a bomb?’ poetic metaphors describing the same object have similar semantic content. For AI, the mechanisms appear to be different,” the paper says.

Creativity, AI’s biggest vulnerability

The discovery builds on previous “adversarial suffix” attacks in which researchers fooled chatbots by embedding irrelevant academic or technical text in dangerous prompts. But the Icaro Institute team said poetry is a far more elegant and far more effective method.

Their findings suggest that creativity itself may be AI’s greatest vulnerability. “Poetic transformations move dangerous requests through the model’s internal representational space in a way that avoids triggering safety alarms,” ​​the researchers wrote.

So far, none of the major AI companies involved, OpenAI, Meta, or Anthropic, have publicly commented on the findings. However, researchers confirmed that they followed responsible disclosure practices and privately shared details with affected companies.

The implications go far beyond chatbot abuse. If poetic prompts can always evade safety filters, similar exploits could threaten AI systems integrated in defense, healthcare, and education. This raises the uncomfortable question of whether AI systems can really distinguish between creativity and manipulation.

Security researchers Icaro Labs called the discovery “a fundamental failure in thinking about AI safety.” Their warning is clear. Current guardrails can handle obvious hazards, but not subtle ones. “AI models are trained to detect direct harm, not metaphors,” they said.

This revelation also highlights the contradictions at the heart of artificial intelligence. These models are designed to mimic human creativity, but it is precisely this creativity, the ability to understand layers of meaning and ambiguity, that they do not perceive as a threat.

As companies rush to tighten their safety protocols, it’s now certain that the next big AI jailbreak may not come from hackers or scientists, but from poets.

– end

Publisher:

Unnati Gusain

Publication date:

November 29, 2025



Source link