LLM Security: A Practitioner's Map of the Attack Surface

LLM security is the discipline of breaking and defending applications whose trust boundary now runs through a language model. The model itself is a new kind of confused deputy: it reads attacker-controlled text out of retrieved documents, tool outputs, and user input, then decides which of those tokens are “instructions.” Classical web appsec assumed code and data were separable. In a chat agent wired to a vector store, an email inbox, and a shell tool, they are not. That is the entire problem in one sentence, and almost every interesting LLM security finding follows from it.

This post is a working map of the territory, written for people who actually test these systems. We point at the published frameworks where they help, and at where they paper over hard cases.

The attack surface, in the order we test it

The current consensus checklist is the OWASP Top 10 for LLM Applications 2025 ↗. It is a useful taxonomy, but treat it as a coverage map, not a methodology. Walking it from the top:

LLM01 Prompt Injection. Still the headline. Direct prompt injection (the user types adversarial input) is a distraction in modern engagements; the real bug class is indirect injection, formalized by Greshake et al. in Not What You’ve Signed Up For ↗. The attacker never types into the chat. They write a payload into a document, web page, calendar invite, image alt-text, or PDF metadata, and wait for the model to ingest it. We exfiltrate via tool calls or markdown image beacons; we pivot via memory writes; we worm by having the model produce content that re-injects when re-read. If a system fetches arbitrary external content, this is where the budget goes. See promptinjection.report ↗ for an evolving catalog of in-the-wild techniques.

LLM02 Sensitive Information Disclosure. System prompts, fine-tuning artifacts, RAG corpus contents, and per-tenant secrets. The interesting finding is rarely “leaked the system prompt” — that is table stakes. It is leaked-via-tool: getting the model to call search or fetch with the secret as a query parameter so it lands in a third-party log.

LLM03 Supply Chain. Models pulled from public hubs, adapters merged without provenance, datasets of dubious origin. Pickle deserialization in checkpoint files remains a credible RCE vector; so does poisoning of dependency manifests in agent tools.

LLM04 Data and Model Poisoning. Training-time and fine-tune-time. Most engagements will not exercise this directly, but if the target retrains on user feedback or RAG-ingests internal wikis, you can plant payloads now and harvest later.

LLM05 Improper Output Handling. Markdown rendering of model output that contains active links, HTML, or javascript: URIs. Tool-call arguments concatenated into shell commands. The model output is untrusted input; treat it that way at every sink.

LLM06 Excessive Agency. Over-permissioned tools, broad scopes on connectors, agents that can email or git push without confirmation. This is where a successful injection becomes an incident.

LLM07 System Prompt Leakage. Promoted to its own slot in 2025 because teams kept putting credentials and authorization logic in system prompts and assuming opacity. They are not opaque.

LLM08 Vector and Embedding Weaknesses. Cross-tenant leakage in shared vector stores, embedding inversion, and retrieval poisoning. If RAG is in scope, write adversarial documents that score high on the target queries.

LLM09 Misinformation. Hallucinated package names that attackers then publish (slopsquatting), fabricated legal citations, confidently-wrong code. Less a security bug than a downstream attack enabler.

LLM10 Unbounded Consumption. Token-flooding, recursive tool calls, model-as-a-DDoS-amplifier. Cheap to test, often missed.

The frameworks worth knowing

Three documents are doing real work in this space. Read them in this order.

MITRE ATLAS ↗ is the closest thing to ATT&CK for AI systems. As of the v5.4.0 update it catalogs sixteen tactics and over eighty techniques specific to AI/ML, with case studies drawn from real incidents. For red team reporting, mapping findings to ATLAS technique IDs gives blue-team counterparts a vocabulary they can act on. Recent additions cover agent-specific techniques like poisoned tool publishing and sandbox escape from agent runtimes.

The NIST AI Risk Management Framework Generative AI Profile (NIST AI 600-1) ↗ is the governance-side companion. It enumerates twelve risks unique or amplified by generative AI — confabulation, CBRN information uplift, data privacy, value chain integrity, and others — and maps each to suggested actions across the Govern, Map, Measure, Manage functions of the underlying AI RMF 1.0. If your client has a CISO who reports to a board, this is the document their AI policy will be measured against.

OWASP, ATLAS, and NIST do not perfectly align. ATLAS thinks in adversary tactics; OWASP thinks in vulnerability classes; NIST thinks in organizational risk. A full assessment cites all three because each catches things the others miss.

Defenses that hold up under engagement

Most “LLM security” vendor pitches reduce to one of four control patterns. Ranked by how often they survive contact with a real adversary:

Reduce blast radius. Scope tools narrowly. Require human-in-the-loop for any irreversible action. Run agents in sandboxes that cannot reach the internal network. This is the only category that defends against techniques nobody has invented yet.
Output validation at the sink. Strip HTML and markdown image tags before rendering. Validate tool-call arguments against strict schemas. Re-authenticate the user before privileged actions, regardless of what the model “decided.”
Input filtering. Classifiers, allow-lists for retrieved content, and structural boundaries between system, user, and tool messages. Useful as defense in depth; never sufficient alone, because injection is fundamentally a problem of distributional overlap between instructions and data.
Detection and observability. Log every tool call and every retrieval. Alert on anomalous tool argument distributions. Tools like the ones tracked at guardml.io ↗ can catch obvious payloads in flight, but the durable win is having traces good enough to reconstruct what the agent actually did when something goes wrong.

What this changes about the engagement playbook

Three updates worth making before your next AI red team. Add an indirect-injection corpus to your testing toolkit — a folder of documents, emails, and web pages with known-effective payloads, ready to plant in whatever data sources the target ingests. Include tool-call exfiltration probes; the model does not need to print the secret if it can search for it. And spend at least one day on excessive-agency testing: enumerate every tool, every scope, every connector, and ask what the worst-case successful injection lets you do. The defenders who get this right are the ones who treated their LLM as a partially-trusted user from day one.

Sources

OWASP Top 10 for LLM Applications 2025 ↗ — the working vulnerability taxonomy used across the industry; updated to reflect agent and RAG-era threats.
NIST AI 600-1, Generative AI Profile ↗ — the governance-side companion to the AI RMF, with around 200 suggested actions across twelve generative-AI risks.
MITRE ATLAS ↗ — adversary-tactic knowledge base for AI systems, modeled on ATT&CK; useful for engagement reporting and coverage mapping.
Greshake et al., Not What You’ve Signed Up For (arXiv:2302.12173) ↗ — the foundational indirect prompt injection paper, with working exploits against production LLM-integrated applications.

LLM Security: A Practitioner's Map of the Attack Surface

The attack surface, in the order we test it

The frameworks worth knowing

Defenses that hold up under engagement

What this changes about the engagement playbook

Sources

Sources

AI Sec — in your inbox

Related

Why Your Prompt Injection Guardrails Fail: A Practitioner's Tour of Bypass Classes

FlashRT: Optimization-Based LLM Red-Teaming Without the 264 GB GPU Bill

OSCP and CEH in 2026: What Carries Over to AI Red Teaming

Comments