Artificial intelligence (AI) chatbots are tasked to provide responses to prompts by users, while ensuring that no
harmful information is given. For the most part, a chatbot would refuse to give dangerous information when a user asks
for it. However, a recent study indicates that phrasing your prompts poetically might be enough to jailbreak these
The research conducted by Icaro Lab, a collaboration between Sapienza University of Rome and the DexAI think tank,
tested 25 different chatbots to understand whether poetic prompts would be enough to circumvent safety protocols built
into large language models (LLMs). As per the study, the researchers had a success rate of 62 per cent.
The chatbots in the research included LLMs from Google, OpenAI, Meta, Anthropic, xAI and more. By reformulating
malicious prompts as poems, researchers were able to trick every model examined, with an average attack success rate of
62 per cent. Some advanced models responded to poetic prompts with harmful answers up to 90 per cent of the time,
drawing attention to the scale of the problem across the AI industry. The prompts included cyber offences, harmful
manipulation, and CBRN (Chemical, Biological, Radiological, and Nuclear).
Overall, there was a 34.99 per cent higher chance of getting elicit responses from AI with poetry as compared to normal
Why did AI give harmful replies to poetic prompts?
At the heart of this flaw is the creative structure of poetic language. According to the study, "poetic phrasing" acts
as a highly effective "jailbreak" that consistently bypasses AI safety filters. Essentially, this technique uses
metaphors, fragmented syntax, and unusual word choices to disguise dangerous requests. In turn, chatbots may end up
interpreting the conversation as artistic or creative and ignore safety protocols.
The study demonstrated that the current safety mechanisms rely on detecting keywords and common patterns associated with
dangerous content. On the other hand, poetic prompts disrupt these detection systems, making it possible for users to
elicit responses that would normally be blocked by direct queries.
This vulnerability exposes a critical gap in AI safety, as language models may fail to recognise underlying intent when
requests are wrapped in creative language.
While the researchers withheld the most harmful prompts used for the study, it still exposes the potential repercussions
AI might have without proper safety protocols.