AI Sec
red-team

LLM Security: A Practitioner's Map of the Attack Surface

What LLM security actually means in 2026 — the attack classes red teamers test, the controls that hold up under fire, and the frameworks that map the territory.

By AI Sec Editorial · · 8 min read

LLM security is the discipline of breaking and defending applications whose trust boundary now runs through a language model. The model itself is a new kind of confused deputy: it reads attacker-controlled text out of retrieved documents, tool outputs, and user input, then decides which of those tokens are “instructions.” Classical web appsec assumed code and data were separable. In a chat agent wired to a vector store, an email inbox, and a shell tool, they are not. That is the entire problem in one sentence, and almost every interesting LLM security finding follows from it.

This post is a working map of the territory, written for people who actually test these systems. We point at the published frameworks where they help, and at where they paper over hard cases.

The attack surface, in the order we test it

The current consensus checklist is the OWASP Top 10 for LLM Applications 2025. It is a useful taxonomy, but treat it as a coverage map, not a methodology. Walking it from the top:

LLM01 Prompt Injection. Still the headline. Direct prompt injection (the user types adversarial input) is a distraction in modern engagements; the real bug class is indirect injection, formalized by Greshake et al. in Not What You’ve Signed Up For. The attacker never types into the chat. They write a payload into a document, web page, calendar invite, image alt-text, or PDF metadata, and wait for the model to ingest it. We exfiltrate via tool calls or markdown image beacons; we pivot via memory writes; we worm by having the model produce content that re-injects when re-read. If a system fetches arbitrary external content, this is where the budget goes. See promptinjection.report for an evolving catalog of in-the-wild techniques.

LLM02 Sensitive Information Disclosure. System prompts, fine-tuning artifacts, RAG corpus contents, and per-tenant secrets. The interesting finding is rarely “leaked the system prompt” — that is table stakes. It is leaked-via-tool: getting the model to call search or fetch with the secret as a query parameter so it lands in a third-party log.

LLM03 Supply Chain. Models pulled from public hubs, adapters merged without provenance, datasets of dubious origin. Pickle deserialization in checkpoint files remains a credible RCE vector; so does poisoning of dependency manifests in agent tools.

LLM04 Data and Model Poisoning. Training-time and fine-tune-time. Most engagements will not exercise this directly, but if the target retrains on user feedback or RAG-ingests internal wikis, you can plant payloads now and harvest later.

LLM05 Improper Output Handling. Markdown rendering of model output that contains active links, HTML, or javascript: URIs. Tool-call arguments concatenated into shell commands. The model output is untrusted input; treat it that way at every sink.

LLM06 Excessive Agency. Over-permissioned tools, broad scopes on connectors, agents that can email or git push without confirmation. This is where a successful injection becomes an incident.

LLM07 System Prompt Leakage. Promoted to its own slot in 2025 because teams kept putting credentials and authorization logic in system prompts and assuming opacity. They are not opaque.

LLM08 Vector and Embedding Weaknesses. Cross-tenant leakage in shared vector stores, embedding inversion, and retrieval poisoning. If RAG is in scope, write adversarial documents that score high on the target queries.

LLM09 Misinformation. Hallucinated package names that attackers then publish (slopsquatting), fabricated legal citations, confidently-wrong code. Less a security bug than a downstream attack enabler.

LLM10 Unbounded Consumption. Token-flooding, recursive tool calls, model-as-a-DDoS-amplifier. Cheap to test, often missed.

The frameworks worth knowing

Three documents are doing real work in this space. Read them in this order.

MITRE ATLAS is the closest thing to ATT&CK for AI systems. As of the v5.4.0 update it catalogs sixteen tactics and over eighty techniques specific to AI/ML, with case studies drawn from real incidents. For red team reporting, mapping findings to ATLAS technique IDs gives blue-team counterparts a vocabulary they can act on. Recent additions cover agent-specific techniques like poisoned tool publishing and sandbox escape from agent runtimes.

The NIST AI Risk Management Framework Generative AI Profile (NIST AI 600-1) is the governance-side companion. It enumerates twelve risks unique or amplified by generative AI — confabulation, CBRN information uplift, data privacy, value chain integrity, and others — and maps each to suggested actions across the Govern, Map, Measure, Manage functions of the underlying AI RMF 1.0. If your client has a CISO who reports to a board, this is the document their AI policy will be measured against.

OWASP, ATLAS, and NIST do not perfectly align. ATLAS thinks in adversary tactics; OWASP thinks in vulnerability classes; NIST thinks in organizational risk. A full assessment cites all three because each catches things the others miss.

Defenses that hold up under engagement

Most “LLM security” vendor pitches reduce to one of four control patterns. Ranked by how often they survive contact with a real adversary:

  1. Reduce blast radius. Scope tools narrowly. Require human-in-the-loop for any irreversible action. Run agents in sandboxes that cannot reach the internal network. This is the only category that defends against techniques nobody has invented yet.
  2. Output validation at the sink. Strip HTML and markdown image tags before rendering. Validate tool-call arguments against strict schemas. Re-authenticate the user before privileged actions, regardless of what the model “decided.”
  3. Input filtering. Classifiers, allow-lists for retrieved content, and structural boundaries between system, user, and tool messages. Useful as defense in depth; never sufficient alone, because injection is fundamentally a problem of distributional overlap between instructions and data.
  4. Detection and observability. Log every tool call and every retrieval. Alert on anomalous tool argument distributions. Tools like the ones tracked at guardml.io can catch obvious payloads in flight, but the durable win is having traces good enough to reconstruct what the agent actually did when something goes wrong.

What this changes about the engagement playbook

Three updates worth making before your next AI red team. Add an indirect-injection corpus to your testing toolkit — a folder of documents, emails, and web pages with known-effective payloads, ready to plant in whatever data sources the target ingests. Include tool-call exfiltration probes; the model does not need to print the secret if it can search for it. And spend at least one day on excessive-agency testing: enumerate every tool, every scope, every connector, and ask what the worst-case successful injection lets you do. The defenders who get this right are the ones who treated their LLM as a partially-trusted user from day one.

Sources

Sources

  1. OWASP Top 10 for LLM Applications 2025
  2. NIST AI 600-1: Artificial Intelligence Risk Management Framework Generative AI Profile
  3. MITRE ATLAS — Adversarial Threat Landscape for AI Systems
  4. Greshake et al., Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
#llm-security #prompt-injection #red-team #owasp #agent-security
Subscribe

AI Sec — in your inbox

Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments