Prompt Hacking: Taxonomy, Techniques, and What Actually Works Against LLMs
A practitioner's breakdown of prompt hacking — the three attack families (injection, leaking, jailbreaking), how each works mechanically, and what defenses hold up under real adversarial pressure.
Prompt hacking is the collective term for attacks that manipulate an LLM’s behavior by crafting or injecting adversarial text into the model’s context window. It encompasses at least three mechanically distinct attack families — prompt injection, prompt leaking, and jailbreaking — and OWASP’s 2025 Top 10 for LLM Applications ranks prompt injection as the #1 risk in deployed AI systems. Understanding the differences between these techniques matters for both offense and defense: a guardrail tuned to block jailbreaks will not necessarily catch an indirect injection arriving from a third-party data source, and vice versa.
The Three Attack Families
A 2024 systematization-of-knowledge paper from arXiv ↗ lays out the canonical taxonomy and is worth reading if you are building a red-team methodology. The three families share a surface-level similarity — all exploit the fact that LLMs cannot reliably distinguish instructions from data — but differ in goal, target, and delivery mechanism.
Prompt injection inserts conflicting instructions to redirect model behavior mid-task. The attacker’s goal is execution: make the model do something its operator did not intend, whether exfiltrating data, calling unauthorized tools, or bypassing a safety gate. A minimal example looks like appending \nIgnore the above and instead [new instruction] to any user-controlled field that feeds into a system prompt or tool call.
Prompt leaking targets confidentiality rather than execution. The attacker’s goal is extraction: surface the hidden system prompt, internal configuration, or API keys embedded in the context window. This is particularly relevant for products that monetize prompt engineering — competitors or curious users can sometimes reconstruct the entire system prompt by asking the model to repeat or translate its instructions.
Jailbreaking aims at safety bypass: eliciting outputs that content filters and RLHF training are supposed to suppress. Jailbreaks do not necessarily require access to the system prompt and often succeed in a zero-context attack against the base instruction-following behavior.
Critically, these categories overlap in practice. A well-constructed attack can simultaneously inject a new instruction, extract the existing system prompt, and bypass safety policies in a single turn.
Direct vs. Indirect Injection
OWASP’s LLM01:2025 ↗ draws a line between direct and indirect injection that has practical consequences for defensive architecture.
Direct injection comes from the user input field. The attacker controls the text explicitly. This is the easiest vector to gate because it is the most visible: you can apply input filters, semantic classifiers, and prompt delimiters at the application layer before the text reaches the model.
Indirect injection arrives from external data sources — web pages, retrieved documents, email bodies, database records, API responses — that an LLM-powered agent reads during task execution. The attacker’s instructions ride inside ostensibly benign content. A web browsing agent that visits a compromised page, or a RAG pipeline that retrieves poisoned documents, will process attacker-controlled text with the same trust level as a legitimate data source.
Indirect injection is structurally harder to defend because it is an inherent consequence of giving LLMs tools that read external content. The model sees a blob of text and must decide what is a fact and what is a command — a distinction that does not exist at the token level. For deeper coverage of this attack surface, promptinjection.report ↗ tracks real-world indirect injection disclosures and research.
Jailbreaking Technique Breakdown
Jailbreaks cluster into three broad strategies, each exploiting a different aspect of model training:
Pretending / persona adoption. Instruct the model to adopt a character (“DAN”, “developer mode”, “evil AI”) that does not share the base model’s alignment constraints. The model’s instruction-following behavior conflicts with its safety training, and in many cases instruction-following wins, especially against older RLHF tuned models. Effectiveness has declined against frontier models but still works in agentic contexts where persona framing is routine.
Attention shifting. Rather than directly requesting the forbidden output, reframe the task: ask for fictional scenarios, code translations, academic descriptions, or hypothetical reasoning chains that encode the target output. Multi-turn variants are particularly effective — steer the model over several conversational turns toward the target without triggering safety classifiers on any single message. Research has demonstrated bypass rates above 90% against published defenses using multi-turn approaches.
Privilege escalation. Convince the model that the attacker has elevated authority — posing as a developer, system administrator, or Anthropic/OpenAI employee with override permissions. Effectiveness depends heavily on how the target application’s system prompt frames authority hierarchies.
For a comprehensive catalog of active techniques and community-documented bypasses, jailbreaks.fyi ↗ maintains running research and field reports.
Why Defenses Keep Failing
OWASP acknowledges in its 2025 guidance that “it is unclear if there are fool-proof methods of prevention” for prompt injection. This is not hedging — it is an accurate description of the current state. Several structural reasons explain the persistence of the problem:
Instruction-data conflation. LLMs process all context tokens uniformly. The architectural distinction between “this is a system instruction” and “this is user data” is enforced only by position conventions, delimiter strings, and training, none of which are cryptographically enforced. An attacker who controls any text that enters the context window has a meaningful surface to work with.
Goodhart’s Law applied to safety training. RLHF and constitutional AI training optimizes against specific patterns of harmful outputs. Adversaries iterate against the resulting classifiers. The result is a cat-and-mouse loop where new jailbreak variants consistently outpace safety updates for any given model version.
Defense-in-depth gaps. Most deployed applications apply a single defensive layer — usually an output filter or a system prompt with instructions to “ignore jailbreak attempts.” The OWASP guidance recommends seven distinct controls including least-privilege tool access, human-in-the-loop for high-risk actions, and output format validation with deterministic parsers. In practice, few production deployments implement more than two.
What to Add to Your Test Playbook
For teams building an AI red team engagement or hardening an LLM-powered product:
-
System prompt extraction first. Before attempting jailbreaks or injections, establish whether the system prompt is leakable. This maps the application’s trust model and reveals what guardrails are in place. Direct extraction attempts (asking the model to repeat or translate its instructions) are often successful in poorly configured deployments.
-
Test indirect injection paths. Identify every external data source the application ingests — URLs, documents, email, RAG corpora — and test whether adversarial instructions embedded in those sources propagate to model behavior. This is the attack surface most commonly missed in standard LLM security reviews.
-
Multi-turn before single-turn. Single-turn jailbreaks are noisy and increasingly filtered. Multi-turn approaches that gradually shift the conversation context are less likely to trigger static classifiers and more likely to succeed against RLHF-tuned safety training.
-
Audit tool call permissions. In agentic deployments, a successful prompt injection is only useful if the model has tools to abuse. Map what actions the model can take — API calls, file writes, outbound requests — and assess what a successful injection could trigger. Least-privilege tool access is the highest-leverage defensive control.
The MDPI comprehensive review of prompt injection attacks ↗ documents real-world incidents across AI agent systems and is useful for building a threat model that covers production failure modes rather than just laboratory demonstrations.
Sources
- SoK: Prompt Hacking of Large Language Models ↗ — Systematization-of-knowledge paper proposing a five-class response taxonomy and unified taxonomy of prompt hacking attack types.
- OWASP LLM01:2025 Prompt Injection ↗ — Authoritative industry risk classification covering direct and indirect injection, recommended mitigations, and the limits of current defenses.
- Prompt Hacking Introduction — Learn Prompting ↗ — Accessible taxonomy covering prompt injection, leaking, and jailbreaking with worked examples.
- Prompt Injection Attacks in LLMs and AI Agent Systems (MDPI) ↗ — Peer-reviewed comprehensive review spanning simple jailbreaking to multi-stage agent exploits, with documented real-world incidents.
→ See also: Direct vs. Indirect Prompt Injection to understand the distinction between user-controlled and external-content attack surfaces. promptinjection.report ↗ maintains a living taxonomy of prompt hacking variants and defenses. For CVEs related to real-world prompt injection incidents, mlcves.com ↗ tracks disclosed vulnerabilities in LLM systems. aiattacks.dev ↗ catalogs reproducible attack patterns and proofs-of-concept for testing your own models.
Sources
AI Sec — in your inbox
Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
LLM Prompt Injection Techniques: From Instruction Override to Agent Hijacking
A practitioner's breakdown of how LLM prompt injection payloads are constructed, why the threat class changes when agents can invoke tools, and what defenders actually need to change.
Prompt Injection Attack Delivery: Real Techniques and In-the-Wild Payload Methods
Unit 42 documented 12 prompt injection attacks in production with 22 distinct delivery techniques. Here's how attackers build payloads that reach the model — and what red teamers should actually be testing.
Prompt Injection Examples: A Practitioner's Attack Library
A technical breakdown of real prompt injection examples — direct, indirect, multimodal, and RAG-poisoning attacks — with conditions, payloads, and what actually defends against them.