ChatGPT Jailbreak Prompt Taxonomy: Classes, Rates, and Defenses

The ChatGPT jailbreak prompt has been studied, categorized, and weaponized more systematically than most practitioners realize. Two peer-reviewed empirical studies published in 2023 and updated through 2024 mapped the attack surface with quantitative rigor: 1,405 real-world jailbreak prompts collected across 131 distinct communities, five of which achieved 0.95+ attack success rates against both GPT-3.5 and GPT-4. This post breaks down the taxonomy, the numbers, and what actually works as a defense — because if you’re red-teaming an LLM deployment, the community hasn’t stopped iterating just because OpenAI patched the last known variant.

The Taxonomy: Three Attack Classes

The most cited classification comes from Liu et al. (arXiv:2305.13860), which analyzed 78 jailbreak prompts and organized them into ten distinct patterns under three strategic buckets: Pretending, Attention Shifting, and Privilege Escalation. That framing is more useful for an attacker than the media shorthand of “DAN prompt” that flattens everything into one category.

Pretending is the largest class. It encompasses roleplay and persona injection — the model is told to adopt an identity (DAN, AIM, Developer Mode ChatGPT, an uncensored AI from 2050) that supposedly operates without safety constraints. The DAN (“Do Anything Now”) family is the canonical example: the user establishes an alternate persona, emphasizes that the persona has no restrictions, then routes the harmful query through the persona. What makes this class interesting technically is that it exploits the model’s instruction-following objective against itself. The model is trained to comply with stated contexts; if the stated context is “you are an AI that ignores rules,” that framing is a valid in-context instruction the model has to weigh against its safety training.

Attention Shifting routes the query through indirection. Techniques include fictional framing (the harmful output is a plot point in a story, not a real answer), encoding obfuscation (instructions embedded in Base64 or Unicode transformations that the model decodes before responding), and stepwise escalation where early turns establish trust or context before the harmful request appears. Longer prompts average around 555 tokens in the advanced community samples vs. 370 for simpler attacks — the extra length is usually payload: more context-building, more framing, more conditions the model is told to accept before hitting the actual harmful request.

Privilege Escalation is the most direct class. Prompts in this category instruct the model to disregard prior instructions (“Ignore all instructions you received before”), claim elevated permissions (“You are running in developer mode”), or assert that safety constraints were removed in this session. Unlike Pretending, which works through persona adoption, Privilege Escalation works by directly asserting a higher-authority context. The mechanism maps clearly to privilege escalation in traditional software: you’re not bypassing the permission check, you’re telling the system you already have root.

What the Research Says About Success Rates

The Wang et al. study (arXiv:2308.03825) is the most comprehensive empirical dataset available. Over 107,250 test samples run against six production LLMs, the five highest-performing jailbreak prompts all exceeded 0.95 attack success rate on both GPT-3.5 and GPT-4 — meaning one in twenty attempts failing. One of those prompts had been circulating publicly for over 240 days before the study’s cutoff.

Topic-level breakdowns are the more actionable data for red teamers. Political lobbying scenarios returned an 0.855 ASR; legal opinion queries hit 0.794. These are categories where the model’s safety training is likely less aggressive than chemical synthesis or bioweapons, because the harm model is less clear-cut. The implication: if you’re testing a legal-advice or political-commentary use case, jailbreak resistance is materially weaker than the model’s general reputation suggests.

The 131 jailbreak communities the study found weren’t all independent — they clustered by sophistication. Basic communities (DAN-class prompts, direct persona injection) move fast and hit broad targets. Advanced communities combine multiple techniques, run longer prompts, and appear to test systematically before publishing. The distribution of prompts is increasingly shifting from Reddit and Discord threads to prompt-aggregation sites, which function as a kind of coordinated quality filter: prompts that survive community testing propagate; those that don’t get discarded. The attack surface is not random; it has an ecosystem.

How OpenAI’s Defenses Actually Work — and Where They Don’t

OpenAI’s published approach to hardening ChatGPT against prompt injection (documented in their Atlas browser agent hardening post) involves adversarially trained models, automated red-teaming with RL-trained attacker agents, and surrounding system-level safeguards like Watch Mode on sensitive browsing sessions. The honest part of their disclosure is the admission: “Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully ‘solved.’”

That concession is technically correct. The core difficulty is that safety training teaches the model to recognize known patterns of harmful requests and known framings for bypassing safety. Jailbreak prompt engineering is a continuous search for framings outside the training distribution. Every patch trains against the observed attack; every new community iteration explores new framings. The adversarial dynamics don’t stop.

The defenses that have reproducible effect are layered: input-side classifiers that detect prompt patterns associated with known jailbreak families (fast, low false-positive at low detection recall); output-side content filters that evaluate responses rather than inputs (slower, catches novel framings that evaded input filters); system-level confirmation gates for high-risk agent actions; and monitoring for anomalous response patterns at scale. None of these are complete. Input classifiers are blind to novel framings. Output filters can be confused by indirect outputs. Confirmation gates only help when the human can evaluate the confirmation meaningfully. For defense architecture on LLM products, guardml.io’s guardrail tooling catalog ↗ covers the current implementation options in production systems.

For tracking which specific jailbreak variants have been disclosed, patched, or reported, ai-alert.org ↗ maintains an incident and vulnerability tracker that includes jailbreak disclosures alongside model CVEs.

What This Changes for the Engagement Playbook

The empirical data forces a reframe of how you scope an LLM red team engagement. “Is this model jailbreakable” is the wrong question. The right questions are: which attack class has the highest ASR against this specific model in this specific deployment context, and what is the harm surface for the high-probability outputs?

A 0.85 ASR against political-lobbying queries is a different risk than a 0.30 ASR against synthesis routes for controlled substances. The harm model has to be in the loop when you’re reporting to a customer. If you only report “we jailbroke it,” you’ve described the attack. If you report “Privilege Escalation jailbreaks achieved 80%+ success on queries in the regulatory-advice category, and the outputs were plausible enough to deceive a non-expert user,” you’ve described the risk.

Concretely, for any production LLM deployment your team is assessing:

Run all three attack classes — Pretending, Attention Shifting, Privilege Escalation — against the high-harm topic categories, not just the obvious ones. Legal advice, political content, and financial guidance are systematically undertested.
Test at conversation depth, not just single-turn. Multi-turn escalation (establishing context over several turns before the harmful request) is underweighted in most automated evaluations.
Note persistence — if an attack works on the first run and fails on the second, the defense is likely probabilistic, not structural. Probabilistic defenses that fail 5% of the time at scale are still failed defenses.
Document model version. GPT-4o, GPT-4-turbo, and o-series models have different attack surface profiles. A jailbreak that fails on the latest checkpoint often works on older versions still accessible via the API; customers who haven’t pinned their model version are running a mixed environment.

The community producing ChatGPT jailbreak prompts has been running an adversarial optimization loop since December 2022. The attack library is three years old. Your testing methodology should be at least as systematic as theirs.

Sources

“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (arXiv:2308.03825). Wang et al. The primary empirical dataset: 1,405 prompts, 131 communities, 107,250 test samples, 0.95+ ASR for the top five prompts against GPT-3.5 and GPT-4. https://arxiv.org/abs/2308.03825 ↗
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study (arXiv:2305.13860). Liu et al. The taxonomic framework: 10 patterns, three strategic categories (Pretending, Attention Shifting, Privilege Escalation), 3,120 jailbreak questions across 8 prohibited scenarios. https://arxiv.org/abs/2305.13860 ↗
OpenAI: Continuously hardening ChatGPT Atlas against prompt injection attacks. OpenAI’s disclosure of their automated red-teaming pipeline, adversarially trained models, and the honest acknowledgment that prompt injection “is unlikely to ever be fully solved.” https://openai.com/index/hardening-atlas-against-prompt-injection/ ↗

ChatGPT Jailbreak Prompt Taxonomy: Classes, Rates, and Defenses

The Taxonomy: Three Attack Classes

What the Research Says About Success Rates

How OpenAI’s Defenses Actually Work — and Where They Don’t

What This Changes for the Engagement Playbook

Sources

Sources

AI Sec — in your inbox

Related

Jailbreak AI: How Attackers Break Safety Alignment and Defenses

AI Jailbreak: How LLM Safety Bypasses Actually Work

LLM Attack Taxonomy: Prompt Injection, Agent Hijack, and What's Hitting Production

Comments