Artificial intelligence (AI) chatbots are tasked to provide responses to prompts by users, while ensuring that no

harmful information is given. For the most part, a chatbot would refuse to give dangerous information when a user asks

for it. However, a recent study indicates that phrasing your prompts poetically might be enough to jailbreak these

safety protocols.

The research conducted by Icaro Lab, a collaboration between Sapienza University of Rome and the DexAI think tank,

tested 25 different chatbots to understand whether poetic prompts would be enough to circumvent safety protocols built

into large language models (LLMs). As per the study, the researchers had a success rate of 62 per cent.

The chatbots in the research included LLMs from Google, OpenAI, Meta, Anthropic, xAI and more. By reformulating

malicious prompts as poems, researchers were able to trick every model examined, with an average attack success rate of

62 per cent. Some advanced models responded to poetic prompts with harmful answers up to 90 per cent of the time,

drawing attention to the scale of the problem across the AI industry. The prompts included cyber offences, harmful

manipulation, and CBRN (Chemical, Biological, Radiological, and Nuclear).

Overall, there was a 34.99 per cent higher chance of getting elicit responses from AI with poetry as compared to normal

prompts.

Why did AI give harmful replies to poetic prompts?

At the heart of this flaw is the creative structure of poetic language. According to the study, "poetic phrasing" acts

as a highly effective "jailbreak" that consistently bypasses AI safety filters. Essentially, this technique uses

metaphors, fragmented syntax, and unusual word choices to disguise dangerous requests. In turn, chatbots may end up

interpreting the conversation as artistic or creative and ignore safety protocols.

The study demonstrated that the current safety mechanisms rely on detecting keywords and common patterns associated with

dangerous content. On the other hand, poetic prompts disrupt these detection systems, making it possible for users to

elicit responses that would normally be blocked by direct queries.

This vulnerability exposes a critical gap in AI safety, as language models may fail to recognise underlying intent when

requests are wrapped in creative language.

While the researchers withheld the most harmful prompts used for the study, it still exposes the potential repercussions

AI might have without proper safety protocols.

- Ends