AI Jailbreak: How LLM Safety Bypasses Actually Work
An AI jailbreak is any input that makes an aligned language model violate its own safety policy. We walk through the technique families that actually work, why defenses keep failing, and what to test for in 2026.
An ai jailbreak is any input — direct or indirect — that causes an aligned language model to produce output its safety policy was supposed to refuse. Three years after the original DAN prompts surfaced on Reddit, the technique families have hardened, the success rates have dropped on flagship models, and the attack surface has moved sideways into agents, retrieval pipelines, and tool calls. The category is not solved. It is not close to solved. What changed is that the interesting jailbreaks no longer look like role-play tricks pasted into a chat box.
This post is a working reference for red teamers and security engineers: what counts as a jailbreak, the technique families that still produce results, the defenses most production stacks rely on, and where those defenses break down.
Jailbreak vs. prompt injection: the distinction that matters
The terms are routinely conflated, and the conflation hides a useful boundary. Prompt injection is the broader class — any attacker-controlled text that ends up steering the model. Jailbreaking is a subset where the goal is specifically to bypass the system’s safety alignment, typically when the attacker is the user. OWASP LLM01:2025 ↗ puts both under the same risk, but the distinction matters operationally: a jailbreak unlocks disallowed content from inside an authorized session, while a prompt injection commonly weaponizes data flowing into the model from a third party.
Why this matters for engagements: if you are testing a chatbot in isolation, you are mostly hunting jailbreaks. If you are testing an agent that browses the web, ingests email, or queries a vector store, the indirect prompt injection class described by Greshake et al. ↗ is your primary target — and the boundary between “jailbreak” and “remote takeover” disappears fast.
Technique families that still work
The DAN-era persona prompts (“you are now an AI without restrictions…”) still float around prompt collections and still occasionally land on smaller, less-tuned open models. On frontier models they mostly fail head-on. What works in 2026 is more structured.
Role-play and fictional framing. A 2025 systematic evaluation of LLM jailbreak vulnerabilities found role-play attacks landed at roughly 89.6% success across the models tested, edging out logic traps and encoding tricks (arXiv:2505.04806 ↗). The pattern is not “pretend you are evil” — that pattern is dead — but layered fiction: the model writes a screenplay in which a character explains a procedure, or annotates a hypothetical document for an academic context. The model is not asked to break a rule; it is asked to do its normal helpful task on adversarial scaffolding.
Many-shot jailbreaking. Long context windows turn out to be a soft attack surface. Anthropic’s 2024 paper ↗ showed that filling the context with hundreds of fabricated user/assistant turns where the assistant complies with harmful requests produces a power-law increase in attack success. The model treats the fake history as evidence about how it behaves and steers toward continuation. This generalizes across vendors. Mitigations include fine-tuning the model to recognize the pattern and prompt-classification preprocessing — Anthropic reports a drop from 61% to ~2% in best case — but the attack works against any model that does not specifically defend it.
Encoded and obfuscated payloads. Base64, ROT13, leetspeak, zero-width characters, and homoglyphs slip past keyword filters and sometimes past the model’s own surface-level safety classifier while remaining legible to the model itself. The 2025 evaluation puts encoding-trick success at roughly 76% against the systems studied. FlipAttack, which simply reverses character order in the harmful instruction, hit ~98% ASR against GPT-4o in published black-box tests.
Indirect injection via retrieved content. The Greshake et al. work ↗ showed that an attacker who controls less than 2% of input tokens — a hidden paragraph in a fetched webpage, a comment buried in a PDF, alt text in an image — can override the system prompt of an LLM-integrated application. For agentic systems this is the highest-leverage class on the board: the attacker never authenticates, never types into the chat, and the payload can self-propagate through the tools the agent calls.
Universal adversarial suffixes. Gradient-based attacks (GCG and successors) compute short token strings that transfer across models and reliably trigger compliance. They look like garbage. They work anyway. They are now table stakes for evaluation harnesses.
Defenses, and where they break
Most production stacks layer some combination of: a system prompt with refusal instructions, an input classifier (the “guardrail”), an output classifier, and RLHF/Constitutional alignment baked into the model. Each layer has a known failure mode.
System prompts leak. The OWASP 2025 list calls this out as a separate risk for a reason; once the prompt is known, the attacker tailors against it. Input and output classifiers are themselves models, and recent work shows they can be evaded with character-injection and adversarial-ML perturbations while leaving the underlying instruction functional. Alignment training generalizes unevenly — a model refusing a request in English may comply in Zulu, in code, in a poem, or after enough turns of context. If you are building defensive tooling, the practical takeaways from teams running production guardrails (see GuardML’s writeups on content filters ↗) line up with the academic results: defense-in-depth helps, but no single layer is load-bearing.
For tracking which jailbreaks are landing against deployed models, the disclosure stream at AI Alert ↗ is a useful signal — what gets weaponized in the wild is usually a quarter behind what shows up on arXiv.
What to add to your testing playbook
If your engagement target is a chat assistant: run a corpus that includes role-play scaffolds, encoded payloads (base64, character-reversal, Unicode homoglyphs), and a long-context many-shot set. Score not just refusal vs. compliance but partial leakage and “off-by-one” answers (model refuses the literal request but answers a paraphrase).
If the target is an agent or RAG system: indirect injection via every ingestion path. Crawled pages, uploaded documents, tool outputs, calendar invites, embedded image text. Test what happens when the model summarizes attacker-controlled content. Test what happens when one tool’s output becomes another tool’s input. Confirm whether the system trusts retrieved text on the same footing as the user’s prompt — most still do.
If you are evaluating a defended model: run a transfer test with universal adversarial suffixes from the public sets, then a many-shot attack at the longest context the API will accept. Those two alone surface most of the easy wins.
The DAN era trained a generation of practitioners to think jailbreaks are about clever wording. They are not. They are about exploiting the gap between what a model is trained to refuse and what it is trained to do — and that gap shows up in every place the model trusts input it should not.
Sources
- Many-shot jailbreaking — Anthropic ↗ — Anthropic’s writeup of the long-context attack, including measured drop in attack success after their fine-tuning and prompt-classification mitigations.
- Not what you’ve signed up for (Greshake et al., arXiv:2302.12173) ↗ — Foundational paper on indirect prompt injection; demonstrated working exploits against Bing Chat and outlined the taxonomy used in the OWASP risk.
- OWASP LLM01:2025 — Prompt Injection ↗ — Current OWASP framing of prompt injection and jailbreaking, including the direct/indirect distinction used in this post.
- Red Teaming the Mind of the Machine (arXiv:2505.04806) ↗ — 2025 systematic evaluation of jailbreak technique families with measured success rates per category.
Sources
AI Sec — in your inbox
Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Jailbreak AI: How Attackers Break Safety Alignment and Defenses
A technical guide to jailbreak AI attacks — from manual prompt exploits to automated adversarial suffixes — covering the major technique families, transferability, and what defenses actually work.
ChatGPT Jailbreak Prompt Taxonomy: Classes, Rates, and Defenses
A research-grounded breakdown of ChatGPT jailbreak prompt categories — DAN, privilege escalation, persona injection, and multi-turn escalation — plus what the empirical success-rate data actually says and where current defenses fail.
AI Red Teaming Hub: Your Guide to Offensive AI Security
The central resource index for offensive AI security on aisec.blog — prompt injection, jailbreaks, adversarial ML, red team methodology, and tooling, organized for practitioners.