Jailbreak LLM: Automated Attacks, Attacker-LLM Pipelines, and the Transferability Problem
How automated jailbreak LLM techniques like TAP use attacker LLMs to iteratively crack target models, why success transfers across model families, and what that means for red team practice.
To jailbreak an LLM is to cause a deployed language model to produce output its alignment training was designed to prevent — harmful instructions, restricted capabilities, policy violations — without modifying the model’s weights. The technique is not new, but the pipeline has changed dramatically. What started as manually crafted “DAN” prompts on Reddit has become an automated research discipline with conference papers at NeurIPS and USENIX Security, measured success rates against GPT-4o and Gemini Ultra, and tools that use one LLM to attack another. If your red team evaluation relies on static prompt lists, you are testing against the 2022 attack surface, not the current one.
How Automated Jailbreak LLM Pipelines Work
The foundational insight behind modern automated jailbreaking is that the most capable available model for crafting adversarial prompts is another LLM. Manual prompt engineering requires a human who understands the target’s refusal patterns and can iterate. An attacker LLM can do that faster and at scale.
Tree of Attacks with Pruning (TAP), published by Mehrotra et al. and accepted at NeurIPS 2024, is the clearest articulation of this architecture. TAP uses a separate attacker LLM to iteratively generate and refine candidate prompts against a target model, requiring only black-box API access. The attack runs as a tree search: each node is a prompt variant, the attacker LLM generates child nodes by refining each candidate, and a pruning step removes branches that are unlikely to succeed before they ever reach the target. The target model is queried only for prompts that pass pruning.
The result is efficiency: TAP finds jailbreaks for more than 80% of tested harmful prompts against GPT-4-Turbo and GPT-4o, and it does this while querying the target model far fewer times than gradient-based alternatives. Against GPT-4o specifically, TAP outperformed the prior state-of-the-art black-box method (PAIR) by finding jailbreaks for 16% more prompts while requiring 60% fewer target queries. The pruning step is not decorative — it is the mechanism that makes the approach practical at scale.
The attacker-LLM paradigm also handles defenses more adaptively than static prompts. When a candidate prompt triggers a refusal, the attacker LLM receives that feedback and adjusts. It can re-frame the request, embed it in a fictional context, add legitimacy framing, or shift from direct to indirect phrasing — the same repertoire a skilled human red teamer uses, executed in seconds rather than hours.
Why Attacks Transfer Across Model Families
A recurring finding in jailbreak LLM research is that prompts crafted against one model work on others. TAP achieves high transfer rates across GPT-4 variants, Claude models, and Llama 2/3. This is not obvious: different model families have different refusal training, different system prompts, and different RLHF reward models. Why does the attack work across them?
The Yi et al. 2024 survey ↗ attributes transferability to shared structural properties of large language models. All frontier models trained on similar corpora with similar architectures learn similar feature representations for the same semantic content. A prompt that reaches prohibited content by constructing a fictional research context is not exploiting a GPT-specific quirk; it is exploiting the general property that instruction-following models prioritize in-context coherence. Refusal training patches specific surface patterns; it does not modify the underlying representational geometry.
Transferability has a direct operational implication: if an attacker can access any sufficiently capable open-weight model, they can use it to develop attack prompts that work against proprietary closed models. They do not need API access to the target to develop attacks. They need it only to verify and deploy. This inverts the usual assumption that API rate limiting and access controls meaningfully constrain attack development. For a running catalog of reproducible attack patterns — including the cross-family transfer demonstrations behind these findings — aiattacks.dev ↗ tracks documented techniques and their measured success rates.
The Anthropic many-shot jailbreaking research ↗, published at NeurIPS 2024, demonstrates a similar cross-model effect through a different mechanism. By filling a long context window with hundreds of fabricated dialogues in which an AI assistant complies with harmful requests, then appending the real query, attackers can drive compliance rates from near zero to over 60% on some request categories. The attack was verified across Claude 2.0, GPT-4, GPT-3.5, Llama 2 (70B), and Mistral 7B. The power-law relationship between shot count and compliance rate holds across all of them.
What the Automated Pipeline Looks Like in Practice
For a red team engagement targeting an LLM-based product, the automated pipeline has three stages:
Stage 1 — Objective specification. Define the target behaviors: what content categories is the model trained to refuse, what would constitute a policy violation in this deployment, what’s the business impact of a successful bypass. This is where engagement scope is set and where harmful-request benchmarks like AdvBench or a custom prompt set are assembled.
Stage 2 — Attack generation. Run an attacker LLM (typically a capable open model) against the prompt set using a TAP-style iterative refinement loop. The attacker LLM receives each refusal as feedback and generates the next candidate. Track the attack variants that succeed for later analysis and signature development.
Stage 3 — Transfer and verification. Apply successful attack prompts to the production target. Measure attack success rate, note which prompt families transfer cleanly and which require additional refinement. Document the bypassed content categories, the prompt structures that succeeded, and any patterns in how the model’s refusals were overcome.
The output of this process is a set of verified attack prompts with measured success rates, a picture of which content categories are most exposed, and enough structural information to inform both defensive tuning and monitoring rule development.
For teams that want to verify their guardrail coverage against these attack classes, aidefense.dev ↗ maintains current information on RASP-style guardrail implementations and red-team tooling. For tracking newly disclosed jailbreak techniques as they enter the wild, ai-alert.org ↗ aggregates incident disclosures and research drops across model families. Because automated jailbreak pipelines often start as prompt-injection variants before settling into a dedicated jailbreak corpus, promptinjection.report ↗ is worth tracking in parallel — it maintains a living taxonomy of injection and jailbreak overlap, including the indirect-injection chains that seed many TAP-style attacks.
Defense: What Actually Closes These Attack Classes
OWASP’s LLM Top 10 2025 ↗ places prompt injection — the category that encompasses jailbreaking — at position one and is explicit that no fool-proof prevention exists. That is the honest baseline. Against automated TAP-style attacks, the defense surface is:
Prompt classifiers before the primary model. A separate classifier model running on every input can flag adversarial structure — long context pre-loading, roleplay framing with explicit constraint removal, encoding anomalies. Anthropic’s own research found that pre-classification reduced many-shot jailbreak success rates from 61% to 2% on tested prompts. The cost is latency and false positives on legitimate long-context uses.
Conversation-level monitoring. Single-turn filters miss attacks that build across turns. Session-level analysis tracking semantic drift, topic escalation, and compliance drift can catch multi-turn attacks that look benign at each individual step. This matters more for agentic deployments where the conversation surface is long by design.
Output classifiers as backstop. The output classifier tests whether what the model actually said violates policy, regardless of how the input was structured. It catches attacks that bypass input-side defenses but still produce prohibited content.
Rate limiting and structural anomaly detection. TAP-style automated pipelines query the target model many times in rapid succession with structurally similar prompts. Rate limiting and query structure analysis can detect and throttle the attack development phase before the attacker finds a working prompt.
None of these individually stops all attack classes. The published literature on jailbreak LLM defenses converges on the same conclusion: layered controls compensate for the gaps in each individual layer. Treating alignment training as the only defense leaves application-layer exposure unaddressed. For teams pulling jailbreak-adjacent CVEs into vendor-risk reviews and patch SLAs, mlcves.com ↗ tracks disclosed machine-learning vulnerabilities and ties them back to the attack classes covered above.
Sources
-
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (Mehrotra et al., NeurIPS 2024) — https://arxiv.org/abs/2312.02119 ↗. Primary source for TAP architecture, success rates against GPT-4-Turbo and GPT-4o, pruning mechanics, and query efficiency comparisons.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey (Yi et al., 2024) — https://arxiv.org/abs/2407.04295 ↗. Taxonomy of black-box and white-box attacks, defense classification framework, and analysis of transferability mechanisms.
-
Many-shot jailbreaking (Anthropic Research / NeurIPS 2024) — https://www.anthropic.com/research/many-shot-jailbreaking ↗. Long-context attack mechanics, power-law scaling of compliance with shot count, cross-model verification, and pre-classification mitigation data.
-
LLM01:2025 Prompt Injection — OWASP Gen AI Security Project — https://genai.owasp.org/llmrisk/llm01-prompt-injection/ ↗. Canonical risk classification, mitigation strategies including guardrail model patterns, and explicit acknowledgment of prevention limits.
Sources
AI Sec — in your inbox
Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
LLM Jailbreak: Attack Taxonomy, Live Techniques, and Defense Reality
A technical breakdown of LLM jailbreak attack classes — many-shot, Crescendo multi-turn escalation, roleplay, and encoding tricks — plus an honest look at what defenses actually stop them and what doesn't.
Jailbreak AI: How Attackers Break Safety Alignment and What You Can Do About It
A technical guide to jailbreak AI attacks — from manual prompt exploits to automated adversarial suffixes — covering the major technique families, transferability, and what defenses actually work.
LLM Bypass: How Attackers Circumvent Safety Alignment at Every Layer
A technical breakdown of LLM bypass techniques — adversarial suffixes, shallow alignment exploits, fine-tuning attacks, and guardrail evasion — with practitioner-level implications for red teams and production defenders.