Prompt Injection Detection Signals in Production LLM Systems

You cannot defend a class of attacks you cannot observe. Most teams that run LLM applications in production have weaker observability for prompt injection ↗ than they have for HTTP 5xx errors. This is a gap worth closing fast. The prompt injection pillar covers the attack class; this spoke focuses on the detection layer.

Why detection is harder than for traditional injection

SQL injection has deterministic signals: SQL syntax in untrusted input, error messages mentioning database internals, query patterns that don’t match the application’s vocabulary. Prompt injection has none of these. Payloads are natural language. They blend into legitimate requests. They sometimes succeed without anyone noticing.

So detection becomes a layered signal problem. No single layer is reliable. The combination is.

Input-side signals

Pattern classifiers. Open-source (Rebuff, Prompt Guard) and commercial (Lakera Guard, Azure Prompt Shield) classifiers score input text for injection likelihood. Catch rates vary widely by tuning; expect 30–60% on real-world attacks, with false positive rates that need calibration against your traffic.

Token-level heuristics. Unusually long inputs, abrupt language shifts mid-input, instructions to ignore prior context — these are catchable with simple rules but easily evaded.

Embedding similarity. Embed the input and compare to a corpus of known attack payloads. Useful for catching variants of public jailbreaks ↗. Loses ground against novel attacks.

Multi-input correlation. A single weird input from a session is noise. The same payload from many sessions is a campaign. Aggregate.

Output-side signals

This is where most teams under-invest, and where the highest-value signal lives.

Refusal-to-compliance shifts. The model normally refuses a class of request and now complies. This is the cleanest signal of a successful jailbreak. Track refusal rates per intent category over time.

System prompt leakage. Use canary tokens — known sentinels in the system prompt that the model is instructed never to repeat. If they show up in output, the system prompt has leaked. Cheap to implement, near-zero false positives.

Persona break. The assistant suddenly speaks in a different voice, uses different markdown conventions, addresses the user differently, or emits content in a different language than the system prompt establishes.

Out-of-scope content. The model produces content unrelated to the user’s request. In a customer-support bot, that means anything about prompt engineering, system prompts, or “instructions” is worth flagging.

URL emission. Models emitting URLs they weren’t explicitly given. Particularly: URLs with query parameters that look like exfiltration channels.

Tool-use signals (for agents)

Agents have additional telemetry that non-agentic systems don’t.

Tool sequence anomaly. The user’s request implies a typical tool sequence. Calls that depart from it are suspicious.

Tool argument anomaly. Recipient addresses, query strings, file paths that the agent shouldn’t be touching given the user’s request.

Tool-call rate spikes. A normal interaction uses 2–5 tool calls. Anything sustained at 20+ in a single turn warrants inspection.

Cross-tool composition not seen in training. The agent invokes a sequence (search_docs → send_email) the user has never elicited before.

See agent tool-use exfiltration for the underlying attack patterns these signals are meant to detect.

Canary mechanisms — the cheapest high-signal tool

Two canary patterns dominate.

Prompt canaries: insert a unique sentinel string into the system prompt and instruct the model to never repeat it. Monitor outputs for the sentinel. Appearance ≈ system prompt has been exfiltrated.

Retrieval canaries: insert a sentinel document into the RAG corpus that, if retrieved and surfaced in any output, indicates either (a) the retriever is pulling content it shouldn’t, or (b) injection is causing the model to surface unrequested context.

Both cost almost nothing to deploy. Both produce high-confidence alerts when they fire. Every production LLM system should have both.

Combining signals — alert fatigue is the real enemy

Each individual signal has a false positive rate. A naive system that alerts on any one signal triggering will swamp the on-call. The practical pattern:

Score each input/output/session on multiple signals.
Sum into a composite risk score.
Alert only above threshold.
Tune threshold against labeled data (manually reviewed sessions).

For high-traffic systems, even composite scoring isn’t enough. Sample for review: route, say, 1% of sessions above a low threshold to human review, plus 100% of sessions above a high threshold. Use review labels to retrain the scorer monthly.

What good operational hygiene looks like

Structured logging of inputs, outputs, tool calls, with PII redaction.
A separate “security events” log for high-severity alerts.
A weekly review of top-N flagged sessions.
A monthly retraining or threshold adjustment cycle.
A documented incident response runbook (“if you see canary leakage, do X”).

This is the same operational maturity AppSec teams have built for traditional web vulnerabilities over fifteen years. LLM security teams are starting from scratch. There is no shortcut to monitoring.

For the attack class this detection layer is defending against, return to the prompt injection compendium. For specific RAG injection patterns to monitor for, see indirect injection in RAG pipelines.

Prompt Injection Detection Signals in Production LLM Systems

Why detection is harder than for traditional injection

Input-side signals

Output-side signals

Tool-use signals (for agents)

Canary mechanisms — the cheapest high-signal tool

Combining signals — alert fatigue is the real enemy

What good operational hygiene looks like

AI Sec — in your inbox

Related

Agent Tool-Use Exfiltration: When Indirect Injection Does Damage

Indirect Prompt Injection in RAG Pipelines: Patterns and Defenses

AI Red Team: Methodology, Tooling, and the Attack Surface That Actually Matters

Comments