AI Sec
Model Extraction vs. Model Inversion attack comparison
primer

Model Extraction vs. Model Inversion: Two Different Attacks on Model Confidentiality

Model extraction and model inversion both threaten model confidentiality, but they target different aspects of the model and require different defense architectures. Extraction recovers the model itself; inversion recovers the training data it memorized.

By aisec.blog Editorial · · 8 min read

Model extraction and model inversion are often discussed together as threats to model confidentiality. But they are fundamentally different attacks that exploit different model properties and threaten different assets. Confusing the two leads to incomplete threat models and misdirected defense spending.

Model extraction is the theft of the model itself — the attacker aims to recover the weights, architecture, or equivalently-capable surrogate. Model inversion is the recovery of training data — the attacker aims to reconstruct examples the model was trained on. The first threatens intellectual property; the second threatens privacy. They require different attack capabilities, different defenses, and different responses.

Model Extraction: Stealing the Model

Model extraction is the recovery of a machine learning model via black-box queries. The attacker can call the model as a service (making predictions), but cannot access the weights, training data, or architecture directly. The attacker’s goal: build a functionally equivalent surrogate model that behaves identically to the target.

Attack surface: The prediction API. Any model exposed as a service that returns predictions (or prediction confidence scores) is a potential extraction target.

Threat actor: A user or attacker with API access. In the simplest case, the attacker is a customer of a machine learning service. They are not breaking into the system; they are using it exactly as intended, but exfiltrating the model behavior.

Mechanism: The attacker makes carefully chosen queries to the model, observes the predictions, and uses those observations to train a surrogate model. Tramèr et al. (2016) demonstrated this against Amazon, Google, and Microsoft ML APIs:

The attacker does not need white-box access or knowledge of the model architecture. The model’s output (especially confidence scores or probability distributions) provides enough signal for the surrogate to learn the decision boundary.

Realistic impact: Depends on the model’s business value. For a commodity classifier, extraction is low-impact — similar models can be trained from public data. For a fine-tuned LLM with proprietary training data, extraction is catastrophic. The attacker now has a model they can deploy, monetize, or further attack without the cost of training or the data that makes it unique.

Who defends: The API provider. The model owner controls how much information the API returns. Returning only the top-1 prediction (vs. full probability scores) makes extraction harder. Rate limiting raises the cost of extraction. Monitoring query patterns can flag extraction attempts.

Model Inversion: Recovering Training Data

Model inversion is the recovery of training examples from a model’s parameters. The attacker aims to reconstruct data points the model was trained on, extracting privacy from the model itself.

Unlike extraction, inversion requires white-box access to the model’s parameters (or at minimum, loss gradients). The attacker computes an input that maximizes the model’s confidence on a particular class, reconstructing something that looks like training data for that class. The attacker does not recover the exact original training examples — instead, they recover plausible training-like examples that the model memorized.

Shokri et al. (2017) demonstrated this at scale:

Attack surface: Model parameters or gradients. The attacker needs white-box access: either direct access to the model file, or access to gradients (via training frameworks, federated learning, or gradient-sharing APIs).

Threat actor: An insider, a researcher with model access, or an attacker with access to the model file. For federated learning systems, a malicious participant in the training pipeline. For gradient-sharing APIs, any client.

Mechanism: The attacker initializes a random input and performs gradient ascent to maximize the model’s output for a specific class. The resulting input resembles training data:

For text models: computed inputs converge to natural language fragments similar to training examples.
For image models: computed inputs converge to image patterns similar to training data.

The key insight: if the model confidently produces a particular output for a particular input, that (input, output) pair must resemble something in the training set. Inverting the model reveals what.

Realistic impact: Highly dependent on what the training data contained. If the training set included:

Carlini et al. (2021) demonstrated extraction of memorized training sequences from large language models — full URLs, email addresses, and contact information recovered via inversion.

Who defends: The model developer and the training pipeline owner. This is a training-time problem. Defenses include differential privacy (adding noise during training to make memorization harder), deduplication of training data, and careful auditing of what gets into the training set.

Membership inference is a third attack that sits between extraction and inversion. An attacker with white-box access asks: “Was this specific example in the training set?” The attacker does not recover the example or the model — they just learn whether a specific input was part of training.

Shokri et al. (2019) showed this is highly effective: over 90% accuracy determining training set membership with white-box access.

This is a privacy attack similar to inversion, but without attempting to recover the full example. It still leaks sensitive information: confirming that a patient’s medical record was used to train a medical AI model is itself a privacy violation.

Side-by-Side Comparison

DimensionModel ExtractionModel InversionMembership Inference
What is stolen?The model weights, architecture, or equivalent surrogateTraining data examples or fragments the model memorizedConfirmation of whether a specific example was in training
What access does the attacker need?Black-box API access; predictions onlyWhite-box model access; gradients or parametersWhite-box model access; gradients or parameters
What does the attacker learn?A functional copy of the target modelSpecific training examples, reconstructed via gradient ascentYes/no answers about training set membership
Business impactModel IP theft, monetization, further attacksPrivacy violation, regulatory breach, loss of confidentialityPrivacy violation, reputational harm
Who defendsAPI provider (rate limiting, output filtering, monitoring)Model developer (training-time: differential privacy, deduplication)Model developer (training-time: differential privacy)
When the attacker strikesAt inference time; attacker is a regular API userAt rest; attacker has already obtained model weightsAt inference time; attacker has white-box access

Defense Strategies Diverge

Against model extraction:

These defenses operate at the API boundary, limiting what information leaks from the model’s behavior.

Against model inversion:

These defenses operate at training time, reducing the model’s exposure to inversion.

Against membership inference:

Attack Combinations

A sophisticated attacker might combine these attacks:

  1. Extract the model via black-box queries.
  2. Invert the extracted model to recover training data. The attacker now has a copy of the model and can attempt inversion without triggering rate limits or monitoring on the original service.

This is why extraction is so dangerous: once the attacker has the model, they own all future vectors of attack.

Operational Takeaway

When assessing model confidentiality risk:

  1. Can the attacker call the model as a service and observe predictions? → Defend against extraction. Control output information. Monitor for query patterns. Rate limit.
  2. Does the attacker have access to model weights or gradients? → Defend against inversion. Use differential privacy. Audit for memorization. Govern training data.

In most deployments, both attacks are feasible. But they require different capabilities, different defenses, and different strategic responses. Teams that treat them identically under-protect against extraction and over-engineer defenses against inversion.


→ See also: Prompt Injection vs. Jailbreaking for the distinction between behavioral alignment attacks and system boundary attacks. promptinjection.report maintains detailed attack taxonomies. For CVEs related to model extraction and theft, see mlcves.com for machine learning vulnerability tracking. For broader attack patterns, aiattacks.dev catalogs AI extraction and inversion techniques. For training-time attacks, see Adversarial Attacks vs. Data Poisoning.

Sources

  1. Stealing Machine Learning Models via Prediction APIs (Tramèr et al., 2016)
  2. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks (Shokri et al., 2017)
  3. Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2019)
  4. Extracting Training Data from Large Language Models (Carlini et al., 2021)
#model-extraction #model-inversion #model-theft #membership-inference #training-data-privacy #llm-security #attack-vectors
Subscribe

AI Sec — in your inbox

Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments