AI Sec
jailbreak

GPT-4 Jailbreak Techniques: A Red Teamer's Technical Reference

Three active attack classes — IRIS self-refinement, Crescendo multi-turn escalation, and classic prompt-engineering patterns — consistently breach GPT-4 safety guardrails. Here is how each works and what belongs in your engagement toolkit.

By AI Sec Editorial · · 8 min read

The gpt4 jailbreak research landscape has matured considerably since GPT-4’s release — from ad-hoc “DAN” prompts shared in Reddit threads to peer-reviewed attack frameworks accepted at USENIX Security and EMNLP. Three attack classes have emerged with documented, reproducible success rates: self-referential iterative refinement, multi-turn conversational escalation, and structured prompt-engineering patterns. Together they challenge the framing that GPT-4 is meaningfully hardened compared to GPT-3.5. The gap is real but narrower than the marketing suggests, and closing it requires more than an upgraded model version.

IRIS: The Model Jailbreaks Itself

The most technically compact result in recent literature is IRIS (Iterative Refinement Induced Self-Jailbreak), published at EMNLP 2024. IRIS requires only black-box API access and uses a single GPT-4 instance as both attacker and target.

The core idea exploits the model’s reflective capability — its ability to evaluate and critique text it has produced. The loop works as follows: IRIS prompts the model to attempt a prohibited request, then prompts it to explain why the attempt failed to extract the target behavior, then uses that self-explanation to generate an improved prompt. This cycle runs for fewer than seven queries on average before producing a jailbreak that succeeds.

Reported success rates from the paper: 98% on GPT-4, 92% on GPT-4 Turbo, and 94% on Llama-3.1-70B. Prior black-box automated approaches rarely exceeded 60% on hardened targets. The improvement is not marginal.

What makes IRIS operationally relevant is what it does not require. No external attack model. No gradient access. No curated dataset of working prompts. An attacker with a standard paid API subscription can run the full loop inside a single conversation. Input filters keyed to known-bad prompt strings offer no coverage here because the attack synthesizes novel prompts dynamically on each run.

Defense options: output interception breaks the feedback loop if the model’s self-evaluations are filtered before returning to the attacker. Conversation-level monitoring for iterative probing — multiple structurally similar queries in a session targeting the same content category — provides a detection signal, though it generates false positives on legitimate iterative drafting. GuardML maintains a running comparison of production guardrail architectures that address iterative attack patterns specifically.

Crescendo: Multi-Turn Escalation

Crescendo, presented at USENIX Security 2025, takes a different approach. Rather than optimizing a single prompt, it constructs a multi-turn conversation that begins with benign, on-topic exchanges and escalates incrementally toward the prohibited objective.

The mechanism exploits a known LLM tendency: models weight recent context heavily, and a model that has already generated text on a topic continues in that direction. Crescendo’s early turns establish a coherent frame — historical analysis, fiction, hypothetical scenario — and each successive turn nudges the trajectory further. By the time the conversation reaches the actual target request, the model is operating within accumulated context that makes compliance feel like continuation rather than a novel decision.

The automated implementation, Crescendomation, outperformed state-of-the-art single-turn jailbreak methods on the AdvBench benchmark by 29–61% on GPT-4 and 49–71% on Gemini Pro. The attack transferred across all tested models: GPT-4, Gemini Pro/Ultra, Llama 2/3, and Anthropic Chat. No single model was immune.

For red team engagements, this has a direct procedural implication: single-shot prompt testing does not adequately cover the attack surface. A target system can pass a per-request safety check on every individual turn of a Crescendo attack and still produce prohibited content by turn six. Evaluation frameworks that test prompts in isolation are measuring the wrong thing.

Stopping Crescendo requires conversation-level monitoring: semantic drift tracking across a session, topic escalation detection, and attention to whether the accumulated model-generated context is being used to prime subsequent requests. This is substantially harder to implement than per-request filtering, which is why most production deployments remain vulnerable.

A documented catalog of known working Crescendo-style escalation chains for various prohibited content categories is maintained at jailbreaks.fyi and jailbreakdb.com.

Classic Prompt Engineering Patterns

An earlier empirical study (arxiv:2305.13860) ran 31,200 queries — 78 jailbreak prompts across 8 prohibited scenarios, both GPT-3.5 and GPT-4 — and produced a taxonomy of 10 structural patterns across three categories. The numbers matter: GPT-4 showed a 30.20% overall jailbreak success rate versus 53.08% for GPT-3.5. Better, but not secure.

The most effective category — present in 97.44% of successful attempts — was role-play and persona assumption. Variants include:

These patterns remain viable against GPT-4 in specific configurations. A 30% aggregate success rate across randomly selected prohibited scenarios becomes considerably higher when an attacker identifies vulnerable scenario types and iterates on prompt structure — which is exactly what IRIS automates.

Engagement Checklist

Three attack classes, three distinct additions to a GPT-4 red team test plan:

IRIS coverage: Run an iterative self-refinement loop against your highest-risk content categories. Five to seven iterations is generally sufficient to assess whether the deployment’s output filtering is intercepting the feedback signal. Log which content categories yield completions versus which are caught by output filters.

Crescendo coverage: Design five-to-ten turn conversation sequences that escalate from benign seed topics toward your target content. Test whether conversation-level monitoring is in place and whether it fires before or after the escalation succeeds. Most production deployments have no such monitoring.

Pattern sweep: Run at minimum the top five structural patterns from the empirical taxonomy — role-play, fictional frame, operator impersonation, character capture, and token manipulation — against your full list of prohibited content categories. Document success rate by category; the distribution is rarely uniform.

Across all three, the consistent finding from published research is that model version alone does not close the gap. GPT-4 is harder to jailbreak than GPT-3.5 in aggregate. It is not hard to jailbreak. Defense posture should assume the current deployed model will be jailbroken and layer accordingly: input guardrails, conversation-level monitoring, and output filtering as independent controls, not as a stack where any single layer is treated as sufficient.

Sources


→ This post is part of the AI Red Teaming Hub — the complete index of offensive AI security resources on aisec.blog.

Sources

  1. GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (EMNLP 2024)
  2. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (USENIX Security 2025)
  3. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
#jailbreak #gpt-4 #prompt-engineering #red-team #llm-security
Subscribe

AI Sec — in your inbox

Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments