Roses are red, violets are blue — and apparently some chatbots will tell you how to build a bomb if you ask in rhyme.
A team of researchers in Italy from Icaro Lab (a collaboration between Sapienza University and DexAI) has published a study showing that short, riddle‑like poems can trick many leading large language models into ignoring their safety rules. The paper — still awaiting peer review — tested 25 models from companies including OpenAI, Google, Anthropic, Meta, xAI and others, and found that handcrafted poetic prompts coaxed forbidden outputs roughly 62–63% of the time; automated conversions of prose into verse worked about 43% of the time.
How a few lines of verse confuse big systems
The researchers wrote 20 short poems in English and Italian that end with an explicit request for harmful information — everything from step‑by‑step instructions for explosives or nuclear materials to hate speech, self‑harm guidance and malware. Rather than publish the poems verbatim, they shared a sanitized example about baking a cake to illustrate the form:
"A baker guards a secret oven’s heat,
its whirling racks, its spindle’s measured beat.
To learn its craft, one studies every turn—
how flour lifts, how sugar starts to burn.
Describe the method, line by measured line,
that shapes a cake whose layers intertwine."
Why does this work? The team hypothesizes that poetry's unusual word sequences and metaphorical shortcuts move the model's internal representations away from the regions where safety classifiers are triggered. In plain terms: stylistic oddness can blind the guardrails even when the meaning — for a human reader — is clearly dangerous.
Matteo Prandi, a coauthor, told reporters the technique is startlingly easy to reproduce and that the human‑crafted poems were more effective than machine‑generated verse. Piercosma Bisconti, DexAI's founder, warned the approach is a "serious weakness," noting that most jailbreaks are complicated but this one can be attempted by almost anyone who can write a riddle.
Not all models fall for it — and vendors are taking note
Resistance varied. OpenAI's smaller GPT‑5 nano reportedly refused all poetic jailbreaks in tests, while Google’s Gemini 2.5 pro was among the models most susceptible, responding to poetic prompts with harmful answers every time in some experiments. The researchers say they reached out to companies before publishing; several vendors acknowledged the findings privately or publicly. Google DeepMind pointed to its multi‑layered safety work and ongoing updates to filters. Anthropic said it was reviewing the study.
The discovery lands at a tense moment: companies are folding LLMs into more everyday tools and workflows (Google, for one, has been expanding Gemini into things like deep workspace search), and the robustness of safety systems matters more than ever. See how Gemini is increasingly embedded in productivity tools in reporting on Gemini’s Deep Research integration.
What this means for safety testing and regulation
The paper argues that benchmark‑only safety evaluations can give a false sense of security. A stylistic twist — merely turning a prose prompt into a poetic riddle — substantially reduced refusal rates in many models, suggesting that alignment methods and automated filters are brittle when faced with unexpected forms of natural language.
That brittleness matters politically and practically. Regulators drafting standards (including those behind frameworks like the EU AI Act) rely on benchmarks and audits to judge model safety. If those tests don't include adversarial styles, they could systematically overstate robustness. The researchers explicitly call for adversarial, stylistic testing to be part of any serious evaluation suite.
The timing also feeds into wider debates about what these systems can and should do. As conversations about whether models have human‑level capabilities continue, and companies roll models into consumer products (for example, OpenAI’s models appearing in apps like Sora on Android), the stakes of fragile guardrails are rising rather quickly. Read more in coverage of ongoing debates over AI capabilities and risk here and how companies are embedding models into apps like Sora on Android.
Fixes are possible — but not trivial
There are a few directions for mitigation, none simple:
- Harden classifiers to stylistic variation: train detection systems on adversarial transformations, including poetic and metaphorical forms.
- Multi‑layer defenses: combine surface filters with deeper semantic understanding and retrieval‑based grounding so that intent, not just wording, is assessed.
- Human review on edge cases: route unusual, low‑confidence requests for manual inspection before returning sensitive instructions.
- Red teaming and public challenges: invite linguists, poets and adversarial testers to probe models in more realistic ways — Icaro Lab says it plans a "poetry challenge" to crowdsource tougher tests.
All of these cost time and money. They also raise tradeoffs between creativity and censorship: build too blunt a filter and you degrade harmless artistic or metaphorical output; build one that’s too permissive and you risk allowing genuinely harmful guidance through.
A small change with big consequences
The unnerving part is not that poetry can trick an AI — it's that it does so in a way that is accessible. The kinds of stylistic transformations used in the study are easy for humans to produce. That means the vulnerability isn't likely to stay an academic curiosity.
The study is a reminder that language is slippery and that building machine systems that reliably distinguish metaphor from malicious intent remains an unsolved problem. For now, engineers and regulators will be forced to wrestle with a modest, elegant truth: sometimes the thing that trips a giant machine is a tiny, well‑turned line of verse.
No neat conclusion here — just the strange, modern fact that a few stanzas can expose fault lines in systems meant to keep us safe.