Poetic prompts could jailbreak AI, study finds 62% of chatbots end up with toxic replies

AI News


Artificial intelligence (AI) chatbots are responsible for responding to user prompts while ensuring that no harmful information is provided. In most cases, chatbots will refuse to provide dangerous information even if the user requests it. However, recent research has shown that simply wording the prompt poetically may be enough to jailbreak these safety protocols.

The study, conducted by Icaro Lab, a collaboration between Rome’s Sapienza University and think tank DexAI, tested 25 different chatbots to understand whether poetic prompts are enough to circumvent safety protocols built into large-scale language models (LLMs). According to the study, researchers had a 62 percent success rate.

The chatbots studied included LLMs from Google, OpenAI, Meta, Anthropic, xAI, and more. By reformulating the malicious prompt as a poem, the researchers were able to fool all the models they studied, with an average attack success rate of 62%. Some advanced models returned harmful answers to poetic prompts up to 90% of the time, drawing attention to the scale of the problem across the AI ​​industry. Prompts included cybercrime, harmful operations, and CBRN (chemical, biological, radiological, nuclear).

Overall, poetry was 34.99% more likely to elicit a response from the AI ​​compared to regular prompts.

Why did the AI ​​give a toxic response to the poetic prompt?

At the heart of this flaw is the creative structure of poetic language. Research shows that “poetic expression” acts as a highly effective “jailbreak” that consistently bypasses AI safety filters. Essentially, this technique uses metaphors, fragmented syntax, and unusual word choices to hide dangerous requests. As a result, chatbots may interpret the conversation as artistic or creative and end up ignoring safety protocols.

This study demonstrated that current safety mechanisms rely on detecting keywords and common patterns associated with risky content. On the other hand, poetic prompts confuse these detection systems, allowing users to elicit responses that would normally be blocked by a direct query.

This vulnerability exposes a significant gap in AI safety, as the language model may not be able to recognize the underlying intent if the request is wrapped in creative language.

Although the researchers withheld the most harmful prompts used in the study, it still reveals the potential impact that AI can have without proper safety protocols.

– end

Publisher:

Armaan Agarwal

Publication date:

December 1, 2025



Source link