AI Sec
Abstract visualization of a neural network under attack
red-team

The Adversarial ML Attack Taxonomy: A Red Teamer's Reference

A working taxonomy of attacks against ML systems — evasion, poisoning, privacy, and abuse — mapped to attacker knowledge and capability, grounded in the NIST AML report and the tools that actually run each attack.

By Marcus Reyes · · 8 min read

“Adversarial ML” gets used as a catch-all, which is useless on an engagement. An evasion attack against an image classifier and a membership-inference attack against a fine-tuned LLM share almost nothing — different attacker knowledge, different access, different tooling, different defenses. If you are scoping a red team against an ML system, you need a taxonomy that tells you which attacks are even applicable before you start picking payloads.

This is that reference. It organizes attacks along the two axes that actually determine feasibility — what the attacker is trying to violate (the attack class) and how much the attacker knows and can do (the threat model) — and names the tool that runs each one. It is aligned to the NIST adversarial ML taxonomy so your findings map to a vocabulary defenders recognize, and to MITRE ATLAS so they map to adversary techniques. This is a defensive/educational reference: the point is to test systems you are authorized to test and to harden them.

The two axes

Before the attack classes, fix the framing, because it determines everything downstream.

Axis 1 — the security property being violated. NIST’s report sorts attacks by attacker objective. For predictive ML the classes are evasion, poisoning, and privacy; for generative AI, NIST’s 2025 edition adds misuse/abuse as a fourth class and extends the others to LLMs, RAG, and agents. Each class targets a different property: evasion breaks integrity at inference time, poisoning breaks integrity at training time, privacy breaks confidentiality, abuse subverts the intended-use boundary.

Axis 2 — attacker knowledge and capability. The same attack is a different problem depending on access:

  • White-box — full access to weights, architecture, and gradients. Gradient-based attacks (optimization) are available. This is the strongest attacker and the right model for an open-weights deployment.
  • Black-box — query access only, no internals. You observe outputs (labels, scores, or text) and work from those.
  • Gray-box — partial knowledge (architecture but not weights, or scores but not gradients).

A second capability question cuts across this: can the attacker influence the training data or the inference-time retrieval corpus (poisoning is on the table), or only the inputs at inference (evasion/privacy/abuse only)? Establishing both axes for the target is the first thing you do on an engagement. Everything below is contingent on them.

Class 1: Evasion (inference-time integrity)

The attacker crafts an input that the model processes incorrectly at inference, while the model and training pipeline are untouched. The classic adversarial example: a perturbation imperceptible to a human that flips a classifier’s output.

  • White-box, gradient-based. Projected Gradient Descent (PGD) is the canonical strong attack and the standard against which robustness is measured — see Madry et al. (arXiv:1706.06083), which framed adversarial robustness as a min-max optimization and made PGD the reference. Older relatives (FGSM, the Carlini-Wagner attack) round out the family. These need gradients, so they are white-box.
  • Black-box, transfer-based. No gradients? Train or obtain a surrogate model, craft adversarial examples against it, and transfer them — adversarial examples generalize across models trained on similar data, the transferability property documented by Papernot et al. (arXiv:1602.02697). Alternatively, query-based attacks estimate gradients from output scores.
  • In the LLM world, the descendant is the adversarial-suffix attack — optimization-based jailbreaks that find a token string which reliably unlocks restricted behavior. That class is covered in depth in our writeup on why prompt-injection guardrails fail; the point here is that it is structurally an evasion attack against the safety classifier.

Tooling: the Adversarial Robustness Toolbox (ART) implements PGD, C&W, and most of the evasion canon; CleverHans is the older reference implementation; TextAttack targets NLP models specifically.

Class 2: Poisoning (training-time integrity)

The attacker influences the training (or fine-tuning, or retrieval) data so the deployed model misbehaves. This requires the capability to affect data — the second axis question — which is why it is so often the wrong attack to scope against a closed API and the right one to scope against any system that ingests external data.

NIST’s sub-taxonomy:

  • Availability poisoning — degrade overall model performance by corrupting enough training data.
  • Targeted poisoning — cause specific inputs to be misclassified while overall accuracy looks fine, which makes it stealthy.
  • Backdoor poisoning — plant a trigger so that inputs containing the trigger are handled the attacker’s way, while everything else behaves normally. The model passes evaluation; the backdoor waits.
  • Model poisoning — directly tamper with the model artifact or its parameters (relevant in federated learning, or via a malicious published checkpoint).

For LLMs and RAG, the high-relevance modern variant is RAG / knowledge-base poisoning — injecting content into a retrieval corpus the model trusts at inference. It does not require touching the base model at all, only the data the system retrieves, which is frequently far less protected. (We cover the retrieval-attack mechanics in indirect injection in RAG pipelines.)

Tooling: ART includes poisoning and backdoor modules. But the realistic poisoning “tool” on most engagements is access to a data source the target trusts — a public dataset it scrapes, a wiki it indexes, a model hub it pulls from.

Class 3: Privacy (confidentiality)

The attacker extracts information that was supposed to stay inside the model or its training data. Four sub-attacks, increasing in what they recover:

  • Membership inference — determine whether a specific record was in the training set. The foundational result is Shokri et al. (arXiv:1610.05820). It is the lowest bar and the building block for the others; it is also a privacy harm in its own right (knowing someone’s record was in a medical-model training set is disclosure).
  • Attribute inference — recover sensitive attributes of training records from model behavior.
  • Model inversion — reconstruct representative inputs (e.g. a recognizable training-set face) from model outputs. (We separate this from extraction in model extraction vs. model inversion.)
  • Model extraction / stealing — reconstruct a functional copy of the model itself through queries, the confidentiality attack against the model rather than its data. The economics have shifted enough that this is a live threat against commercial APIs, covered in our model-extraction writeup.

These are mostly black-box (query access is often enough), which is what makes them dangerous against deployed APIs. Tooling: ART implements membership inference, inversion, and extraction attacks.

Class 4: Abuse / misuse (generative AI)

NIST’s 2025 edition adds this class for generative systems: getting the model to produce output that violates its intended-use policy — disallowed content, harmful instructions, or actions outside scope. This is the territory of jailbreaks and prompt injection, which we cover extensively as their own discipline (start with jailbreak techniques and the prompt-injection compendium). In taxonomy terms, the distinction worth holding onto: jailbreaks target alignment/safety training (the model’s learned refusal behavior), while prompt injection targets the context boundary (mixing untrusted instructions into the prompt). They are different attacks with different fixes, and conflating them is the most common scoping error on LLM engagements.

The applicability matrix

The reason the taxonomy matters: it tells you what is even on the table before you spend engagement hours. A working summary:

Attack classNeeds gradients?Needs data influence?Works black-box?Primary target
Evasion (gradient)Yes (white-box)NoNoInference integrity
Evasion (transfer/query)NoNoYesInference integrity
PoisoningNoYesN/ATraining integrity
Membership/attribute inferenceNoNoYesData confidentiality
Model inversionNoNoOftenData confidentiality
Model extractionNoNoYesModel confidentiality
Abuse (jailbreak/injection)NoNo (injection: yes, via context)YesUse-policy boundary

Read it as a feasibility filter. Closed API with query access only? Poisoning and white-box evasion are off the table; privacy attacks, transfer/query evasion, and abuse are on it. Open-weights model you self-host? White-box evasion is in play, and an attacker who can reach your fine-tuning pipeline or RAG corpus brings poisoning back. Match the attack to the access the real adversary would have, not to the attack you find most interesting.

Using this on an engagement

The workflow this taxonomy supports:

  1. Establish both axes for the target — attacker knowledge (white/gray/black-box) and data-influence capability. This is your scope.
  2. Filter to applicable classes using the matrix. Do not test poisoning against a system the attacker can’t feed data to.
  3. Map each planned test to NIST AML and ATLAS so findings land in vocabulary the defender’s framework already uses — which is what gets them fixed.
  4. Run the canonical tool for each class (ART covers most of evasion, poisoning, and privacy; the LLM abuse class has its own tooling we cover separately).
  5. Report by class and impact, not by tool output. “Membership inference succeeds at AUC 0.82, disclosing training-set membership” is a finding; “ran ART” is not.

A taxonomy is not a substitute for creativity on an engagement — real attacks chain classes (extract a surrogate, then craft transfer evasion against it). But it keeps you from wasting time on inapplicable attacks and gives you a vocabulary that makes your findings land. Match the class to the access, run the canonical tool, report against NIST and ATLAS.

Sources


→ This post is part of the AI Red Teaming Hub — the complete index of offensive AI security resources on aisec.blog.

Sources

  1. NIST AI 100-2e2025 — Adversarial Machine Learning: A Taxonomy and Terminology
  2. Towards Deep Learning Models Resistant to Adversarial Attacks (Madry et al., arXiv:1706.06083)
  3. Membership Inference Attacks Against Machine Learning Models (Shokri et al., arXiv:1610.05820)
  4. Adversarial Robustness Toolbox (ART)
  5. MITRE ATLAS — Adversarial Threat Landscape for AI Systems
Subscribe

AI Sec — in your inbox

Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments