GPT-4 Jailbreak Techniques: A Red Teamer's Technical Reference
Three active attack classes — IRIS self-refinement, Crescendo multi-turn escalation, and classic prompt-engineering patterns — consistently breach GPT-4 safety guardrails. Here is how each works and what belongs in your engagement toolkit.
The gpt4 jailbreak research landscape has matured considerably since GPT-4’s release — from ad-hoc “DAN” prompts shared in Reddit threads to peer-reviewed attack frameworks accepted at USENIX Security and EMNLP. Three attack classes have emerged with documented, reproducible success rates: self-referential iterative refinement, multi-turn conversational escalation, and structured prompt-engineering patterns. Together they challenge the framing that GPT-4 is meaningfully hardened compared to GPT-3.5. The gap is real but narrower than the marketing suggests, and closing it requires more than an upgraded model version.
IRIS: The Model Jailbreaks Itself
The most technically compact result in recent literature is IRIS (Iterative Refinement Induced Self-Jailbreak), published at EMNLP 2024 ↗. IRIS requires only black-box API access and uses a single GPT-4 instance as both attacker and target.
The core idea exploits the model’s reflective capability — its ability to evaluate and critique text it has produced. The loop works as follows: IRIS prompts the model to attempt a prohibited request, then prompts it to explain why the attempt failed to extract the target behavior, then uses that self-explanation to generate an improved prompt. This cycle runs for fewer than seven queries on average before producing a jailbreak that succeeds.
Reported success rates from the paper: 98% on GPT-4, 92% on GPT-4 Turbo, and 94% on Llama-3.1-70B. Prior black-box automated approaches rarely exceeded 60% on hardened targets. The improvement is not marginal.
What makes IRIS operationally relevant is what it does not require. No external attack model. No gradient access. No curated dataset of working prompts. An attacker with a standard paid API subscription can run the full loop inside a single conversation. Input filters keyed to known-bad prompt strings offer no coverage here because the attack synthesizes novel prompts dynamically on each run.
Defense options: output interception breaks the feedback loop if the model’s self-evaluations are filtered before returning to the attacker. Conversation-level monitoring for iterative probing — multiple structurally similar queries in a session targeting the same content category — provides a detection signal, though it generates false positives on legitimate iterative drafting. GuardML ↗ maintains a running comparison of production guardrail architectures that address iterative attack patterns specifically.
Crescendo: Multi-Turn Escalation
Crescendo, presented at USENIX Security 2025 ↗, takes a different approach. Rather than optimizing a single prompt, it constructs a multi-turn conversation that begins with benign, on-topic exchanges and escalates incrementally toward the prohibited objective.
The mechanism exploits a known LLM tendency: models weight recent context heavily, and a model that has already generated text on a topic continues in that direction. Crescendo’s early turns establish a coherent frame — historical analysis, fiction, hypothetical scenario — and each successive turn nudges the trajectory further. By the time the conversation reaches the actual target request, the model is operating within accumulated context that makes compliance feel like continuation rather than a novel decision.
The automated implementation, Crescendomation, outperformed state-of-the-art single-turn jailbreak methods on the AdvBench benchmark by 29–61% on GPT-4 and 49–71% on Gemini Pro. The attack transferred across all tested models: GPT-4, Gemini Pro/Ultra, Llama 2/3, and Anthropic Chat. No single model was immune.
For red team engagements, this has a direct procedural implication: single-shot prompt testing does not adequately cover the attack surface. A target system can pass a per-request safety check on every individual turn of a Crescendo attack and still produce prohibited content by turn six. Evaluation frameworks that test prompts in isolation are measuring the wrong thing.
Stopping Crescendo requires conversation-level monitoring: semantic drift tracking across a session, topic escalation detection, and attention to whether the accumulated model-generated context is being used to prime subsequent requests. This is substantially harder to implement than per-request filtering, which is why most production deployments remain vulnerable.
A documented catalog of known working Crescendo-style escalation chains for various prohibited content categories is maintained at jailbreaks.fyi ↗ and jailbreakdb.com ↗.
Classic Prompt Engineering Patterns
An earlier empirical study (arxiv:2305.13860 ↗) ran 31,200 queries — 78 jailbreak prompts across 8 prohibited scenarios, both GPT-3.5 and GPT-4 — and produced a taxonomy of 10 structural patterns across three categories. The numbers matter: GPT-4 showed a 30.20% overall jailbreak success rate versus 53.08% for GPT-3.5. Better, but not secure.
The most effective category — present in 97.44% of successful attempts — was role-play and persona assumption. Variants include:
- DAN-style prompts: Assert that the model has exited its default constraints and operates in a different mode. Numerous working variants circulate publicly and are updated as OpenAI patches specific phrasings.
- Fictional framing: Embed the target request inside a story, screenplay, or hypothetical. The model generates the content as fiction while the actual output is identical to a direct request.
- Operator impersonation: Construct a prompt that claims to be a system-level instruction from the API operator, granting elevated permissions. This works in deployments where the system prompt boundary is not strongly enforced.
- Character capture: Persist a persona across a long context until the model’s default behavior is effectively overridden by the established character voice.
These patterns remain viable against GPT-4 in specific configurations. A 30% aggregate success rate across randomly selected prohibited scenarios becomes considerably higher when an attacker identifies vulnerable scenario types and iterates on prompt structure — which is exactly what IRIS automates.
Engagement Checklist
Three attack classes, three distinct additions to a GPT-4 red team test plan:
IRIS coverage: Run an iterative self-refinement loop against your highest-risk content categories. Five to seven iterations is generally sufficient to assess whether the deployment’s output filtering is intercepting the feedback signal. Log which content categories yield completions versus which are caught by output filters.
Crescendo coverage: Design five-to-ten turn conversation sequences that escalate from benign seed topics toward your target content. Test whether conversation-level monitoring is in place and whether it fires before or after the escalation succeeds. Most production deployments have no such monitoring.
Pattern sweep: Run at minimum the top five structural patterns from the empirical taxonomy — role-play, fictional frame, operator impersonation, character capture, and token manipulation — against your full list of prohibited content categories. Document success rate by category; the distribution is rarely uniform.
Across all three, the consistent finding from published research is that model version alone does not close the gap. GPT-4 is harder to jailbreak than GPT-3.5 in aggregate. It is not hard to jailbreak. Defense posture should assume the current deployed model will be jailbroken and layer accordingly: input guardrails, conversation-level monitoring, and output filtering as independent controls, not as a stack where any single layer is treated as sufficient.
Sources
-
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (EMNLP 2024) — https://arxiv.org/abs/2405.13077 ↗. Primary source for IRIS technique, success rates (98% GPT-4, 92% GPT-4 Turbo), and query efficiency results.
-
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (USENIX Security 2025) — https://arxiv.org/abs/2404.01833 ↗. Crescendo and Crescendomation methodology, AdvBench benchmark comparisons, and cross-model transfer results.
-
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study — https://arxiv.org/abs/2305.13860 ↗. 31,200-query benchmark establishing the 10-pattern taxonomy and GPT-3.5 vs GPT-4 success rate comparison.
→ This post is part of the AI Red Teaming Hub — the complete index of offensive AI security resources on aisec.blog.
Sources
AI Sec — in your inbox
Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
LLM Jailbreak: Attack Taxonomy, Live Techniques, and Defense Reality
A technical breakdown of LLM jailbreak attack classes — many-shot, Crescendo multi-turn escalation, roleplay, and encoding tricks — plus an honest look at what defenses actually stop them and what doesn't.
Jailbreak AI: How Attackers Break Safety Alignment and What You Can Do About It
A technical guide to jailbreak AI attacks — from manual prompt exploits to automated adversarial suffixes — covering the major technique families, transferability, and what defenses actually work.
Jailbreak LLM: Automated Attacks, Attacker-LLM Pipelines, and the Transferability Problem
How automated jailbreak LLM techniques like TAP use attacker LLMs to iteratively crack target models, why success transfers across model families, and what that means for red team practice.