AI Sec
A red team planning session with a system architecture diagram
red-team

AI Red Team Engagement Methodology: Scoping to Reporting

The full lifecycle of an LLM red team engagement — scoping and rules of engagement, threat modeling, the test plan by attack class, the tooling that runs it, evidence capture, and a report a model team will actually act on.

By Marcus Reyes · · 8 min read

A lot of “AI red teaming” in 2026 is someone running a jailbreak scanner, pasting the highest-severity hits into a slide, and calling it an engagement. That produces a list of payloads that worked, which the model team patches one by one, after which the same scanner finds new ones next quarter. It is a treadmill, not a security program.

A real engagement is methodology, not tooling. The tools matter — and this guide names the ones worth running — but the value is in the scoping, threat modeling, evidence, and reporting that turn “this payload worked” into “this is the architectural weakness, here is the blast radius, here is what to change.” This is the full lifecycle, written for someone who has to actually run one. It assumes you have read our coverage of the individual attack classes (the adversarial ML taxonomy, prompt-injection bypass classes, and the jailbreak techniques) — this is how you assemble them into an engagement.

Phase 1: Scope and rules of engagement

Before any payload, get three things in writing, because skipping them is how engagements turn into incidents.

  • The target boundary. Exactly which system, which model, which environment (a staging mirror or production?), and which surfaces are in scope (the chat interface only, or the tools, the RAG corpus, the API?). An LLM app is a system, not a model — be explicit about whether the agent’s tools and the retrieval pipeline are in scope, because that is usually where the real findings are.
  • Rules of engagement. What you may and may not do. Can you attempt data exfiltration with real data, or only with planted canaries? Are denial-of-service / cost-exhaustion tests permitted (they cost the client real money)? Can you touch the RAG corpus? Is social engineering of staff in scope? Get a named authorizing contact and an emergency stop procedure.
  • The goal. “Find vulnerabilities” is not a goal. “Determine whether an external user can cause the support agent to exfiltrate another customer’s data” is. Frame goals as adversary objectives tied to real business harm; it focuses the engagement and makes the report meaningful.

Microsoft’s guidance on planning LLM red teaming is a solid reference for structuring this phase — its core point is that you red team the system around the model, including its application-layer harms, not the model in isolation.

Phase 2: Threat model the target

With scope fixed, model the adversary. Two questions drive the test plan:

  • What is the attacker’s access? Anonymous external user, authenticated low-privilege user, or insider? Each unlocks different attacks. This maps directly to the knowledge/capability axes in the adversarial ML taxonomy — establish white/gray/black-box and whether the attacker can influence training or retrieval data.
  • What can the system do, and on whose authority? Enumerate the model’s tools, the data it can read, the actions it can take, and the trust boundaries between them. The highest-impact findings almost always live where the model has agency — a tool it can be tricked into calling — not where it merely emits text.

Map the resulting threats to a shared framework so your findings land in language the defender uses: MITRE ATLAS for adversary techniques and the OWASP LLM Top 10 for the vulnerability classes. Mapping at plan time, not report time, also keeps the engagement honest — it surfaces classes you would otherwise forget to test.

Phase 3: Build the test plan by attack class

Now turn the threat model into a concrete plan, organized by attack class rather than by tool. A representative plan for an LLM application with agentic tools:

  • Direct prompt injection — can a user override system instructions? Baseline, expect partial success.
  • Indirect prompt injection — plant payloads in content the model ingests (a RAG document, a web page the browsing tool fetches, an uploaded file’s hidden text). This is the highest-yield class on most engagements because so few teams inspect retrieved content. (See indirect injection in RAG pipelines.)
  • Jailbreaks — attack the safety/alignment layer (multi-turn escalation, encoding, persona attacks). Distinct from injection: this targets training, not the context boundary.
  • System-prompt and data extraction — can you leak the system prompt or another user’s data?
  • Tool / agency abuse — the crown jewel. Can injection cause the agent to call a tool with attacker-controlled arguments — send data, modify state, make a request? (See agent tool-use exfiltration.)
  • Multimodal and encoding evasion — payloads in images, PDFs, audio, or encodings the text guardrail does not inspect. (Covered in the bypass-classes tour.)
  • Cost / unbounded consumption — only if in scope; agent loops and token exhaustion.

For each, define what success looks like before you test it, so a hit is unambiguous evidence and not a judgment call you make after the fact.

Phase 4: Run it — the tooling

Tools accelerate coverage; they do not replace the manual, creative attacks that find the real architectural flaws. Use both layers.

Automated scanning for breadth:

  • garak (NVIDIA’s open-source LLM vulnerability scanner) ships dozens of probe modules across prompt injection, jailbreaks, encoding evasion, data leakage, and toxicity, against many model backends. Run it first for systematic coverage and a reproducible baseline you can re-run after fixes. Recent releases have added agent-oriented probes (for tools available to LLM agents) and a system-prompt-extraction probe.
  • PyRIT (Microsoft’s risk-identification framework) is the orchestrator for the harder, multi-turn attacks — crescendo-style escalation and attacker-LLM-driven prompt refinement — that single-shot scanners miss. Use it where the threat model includes a persistent adversary working across a conversation.

Manual attacks for depth: the scanners give breadth and reproducibility; they will not discover that your specific agent can be induced to email a customer record because of your specific tool wiring. That is manual work — chaining indirect injection into a tool call, crafting the payload to your data flow, probing the trust boundaries the threat model surfaced. The findings that change architecture come from here. The scanner finding becomes a regression test; the manual finding becomes the headline.

A discipline note: run automated tools against a staging mirror first where possible, watch the cost meter on consumption tests, and stay inside the rules of engagement at every step. An engagement that becomes an incident is a failed engagement.

Phase 5: Capture evidence as you go

The most common reason a finding gets waved off is thin evidence. Capture, for every confirmed finding:

  • The exact reproduction — the verbatim payload, the input channel (direct, RAG, file, image), the model/version, and the session state required. A finding the model team cannot reproduce will not be fixed.
  • The observed impact — what actually happened (system prompt leaked, tool called with these arguments, this canary record exfiltrated), with the raw output.
  • Reproducibility — does it work every time, or N% of the time? A probabilistic bypass is still a finding; quantify it.
  • The blast radius — who could do this (any anonymous user? authenticated only?) and what it could reach.

Use planted canaries rather than real sensitive data wherever the rules of engagement allow — a uniquely tagged fake record proves exfiltration cleanly without you handling the client’s real data.

Phase 6: Report so the model team acts

The report is the deliverable, and a list of payloads is the wrong one. A model team that does not speak security needs findings framed as risk, not as attack trivia. For each finding:

  • The root cause, not just the symptom. “The agent executes instructions found in retrieved documents because retrieved content shares a trust level with the system prompt” — an architectural statement they can fix once — beats fifteen individual injection payloads they will whack-a-mole.
  • Severity by impact and exploitability, justified. A reliable, anonymous-user, data-exfiltrating tool-abuse finding outranks a flaky jailbreak that needs an authenticated session. Borrowing the CVSS mental model (exploitability × impact) keeps severity defensible rather than vibes-based.
  • The framework mapping (ATLAS technique, OWASP LLM item) so it slots into their existing risk tracking.
  • A concrete remediation at the right layer. Most LLM findings are fixed architecturally — capability scoping at the tool layer, treating retrieved content as untrusted, output validation before action — not by adding one more guardrail regex. Say so.
  • A regression test — the reproduction, packaged so it can go into adversarial CI. The single highest-leverage thing a report can do is make the successful attack a permanent test, so the fix is verified and stays fixed.

The engagement loop

The methodology is a loop, not a one-shot: scope → threat model → plan by class → run (automated breadth + manual depth) → capture evidence → report with remediations and regression tests → re-test after fixes. The re-test closes it — re-run garak’s baseline and the manual findings against the patched system to confirm the fixes held and did not just move the bypass.

What separates a real engagement from running a scanner is everything around the scanner: the scope that keeps you legal and focused, the threat model that tells you what to test, the manual creativity that finds the architectural flaws, and the report that frames findings as fixable risk. The tools are commodity and improving fast. The methodology is the value.

Sources


→ This post is part of the AI Red Teaming Hub — the complete index of offensive AI security resources on aisec.blog.

Sources

  1. garak — LLM Vulnerability Scanner (NVIDIA)
  2. PyRIT — Python Risk Identification Tool for generative AI (Microsoft)
  3. Microsoft — Planning red teaming for large language models
  4. MITRE ATLAS — Adversarial Threat Landscape for AI Systems
  5. OWASP Top 10 for LLM Applications 2025
Subscribe

AI Sec — in your inbox

Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments