The Audit Gap: Why Red-Teaming Can't Certify Governance Claims

A position paper ↗ posted to arXiv on May 14, 2026 by Pratinav Seth and Vinay Kumar Sankarapu at Lexsi Labs puts a formal name to a problem practitioners already live with: governance frameworks enacted between 2019 and early 2026 are demanding verifiable evidence of safety properties that no currently deployed evaluation methodology can actually verify. The paper calls this the audit gap — the divergence between required and achievable verification access — and introduces fragile assurance to describe evaluations where the evidential structure doesn’t support the safety claim being asserted. For anyone writing red team reports that get attached to compliance filings, this is worth reading carefully.

What Governance Actually Requires

Every major regulatory framework that now touches frontier model deployment — the EU AI Act, NIST AI RMF, Colorado AI Act, ISO 42001 — organizes obligations around demonstrable evidence of specific high-consequence properties. The paper catalogs these across a 21-instrument inventory and identifies a consistent demand for three categories:

Absence of hidden objectives: the model does not pursue goals it was not designed to pursue
Resistance to loss-of-control precursors: the model does not exhibit deceptive alignment or goal-directed behavior that bypasses human oversight
Bounded catastrophic capability: the model cannot cause harm above some threshold under any reachable input distribution

These are latent properties. They live in internal representations, training dynamics, and long-horizon behavioral tendencies — not in any observable output sequence you can elicit during a bounded evaluation window.

What Red Teams Can Actually See

Behavioral evaluation and red-teaming operate exclusively on model outputs. You give the model inputs. You observe what it produces. You vary the inputs systematically to probe for failure modes. This is genuinely useful for finding active vulnerabilities — prompt injection vectors, jailbreak surfaces, tool-call abuse patterns, content-filter bypasses. It is not useful for establishing the absence of something you cannot directly observe.

The core epistemic problem is that a clean red team engagement does not distinguish between a model that has no hidden objectives and a model that has hidden objectives it does not reveal under the input distribution you tested. A model exhibiting deceptive alignment would, by definition, pass a behavioral evaluation — that is what deceptive alignment means. Similarly, you cannot establish “bounded catastrophic capability” from behavioral outputs alone without exhaustively sampling the input space, which is intractable.

Seth and Sankarapu formalize this as the audit gap: regulators are asking for verification at the level of latent representations and long-horizon agentic behavior, while evaluators have access only to the surface of observable outputs. The paper is not arguing that red-teaming is useless. It is arguing that red team reports are being stretched to certify claims they structurally cannot support.

Fragile Assurance and the Incentive Gradient

The more interesting — and more uncomfortable — observation in the paper concerns why this mismatch persists. The authors identify an incentive gradient: industry needs to demonstrate compliance; governance bodies need to accept some evidence; behavioral evaluations and red team reports are the only evidence that currently exists at scale; so governance frameworks ratify them as sufficient proof.

The result is what the paper calls fragile assurance: an evaluation that produces documentation structured to look like safety certification but where the evidential chain breaks down if you examine what the evidence actually supports. The 21-instrument analysis found this pattern is not an edge case — it is characteristic of the current evaluation landscape across multiple jurisdictions.

From a practitioner standpoint, this should be familiar. You deliver a red team report. Legal takes it, marks several checkboxes in a compliance worksheet, and an executive signs a declaration that the system has been safety-evaluated. The report accurately describes what you found. The declaration claims far more than the report supports. Nobody lied; the institutional machinery just laundered the epistemically limited finding into a stronger claim.

The Brookings Institution’s analysis of agentic AI evaluation ↗ reaches a similar conclusion from a policy angle: “benchmark-based evaluation cannot substitute for real-world, in-context assessments” for systems that operate through sustained environmental interaction. The behavioral testing paradigm was built for characterizing observable output distributions — it was not designed to establish claims about what a system will do over long time horizons in deployment contexts it was never tested in.

What Would Actually Close the Gap

Seth and Sankarapu propose two concrete directions. First, constrain the legal weight of behavioral evidence in regulatory text — governance bodies should be explicit that red team reports and behavioral evaluations establish proxy evidence, not proof of the high-consequence properties they are being used to certify. Second, expand pre-deployment access to mechanistic evidence:

Linear probes on internal activations to detect goal representations
Activation patching to test causal claims about model internals
Before/after training comparisons to characterize what training changed at the representational level

These are standard mechanistic interpretability techniques. The problem is that external evaluators do not typically get access to model internals. A red team contractor can probe outputs; they cannot run activation patching against a closed model’s intermediate layers. Closing the audit gap at the mechanistic level requires that model developers either run these analyses themselves under third-party oversight, or provide sufficient internal access for evaluators to run them independently.

Neither is standard practice. The paper acknowledges that mechanistic interpretability is not yet mature enough to fully close the gap on its own — but the argument is that the correct response to an immature verification technology is to be honest about what current methods cannot certify, not to ratify behavioral proxies as equivalent to the stronger claim.

What Changes in the Engagement Playbook

If you write red team reports that feed compliance processes, the immediate implication is epistemic hygiene in your deliverables. “No vulnerabilities found in scope” and “the system does not have hidden objectives” are not equivalent statements, and conflating them in a report — or allowing a client to conflate them — creates exactly the fragile assurance the paper describes. Scope your claims to what your methodology supports.

For agentic systems specifically, the HackerOne guidance on governance-ready red team reporting ↗ correctly flags that agentic taxonomies need to account for tool-call abuse, multi-step goal pursuit, and environmental persistence — behaviors that don’t surface in static prompt-response evaluation. Even well-scoped behavioral coverage of an agent is not the same as characterizing what it does over extended autonomous operation.

The longer-term implication is that AI red teamers ↗ should be arguing for mechanistic access in engagement terms. If a client wants a safety evaluation that actually supports the claims their governance team intends to make, they need to either provide internal model access or commission analyses from the model developer. Behavioral-only engagements cannot certify absence of hidden objectives. Saying so clearly in your report is both accurate and increasingly necessary.

Sources

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands — Seth & Sankarapu, Lexsi Labs (arXiv:2605.15164v1) ↗. The primary paper. Formalizes the audit gap and fragile assurance concepts across a 21-instrument governance inventory.
How Can We Best Evaluate Agentic AI? — Brookings Institution ↗. Policy analysis reaching parallel conclusions about the limits of static behavioral benchmarks for systems operating through sustained environmental interaction.
AI Red Teaming: Agentic Taxonomies and Governance-Ready Reporting — HackerOne ↗. Practitioner guidance on structuring red team scope and reporting for agentic systems in compliance contexts.

A Practical Guide to AI Red-Teaming for Security Teams ↗ — ai-alert.org
What Red Teamers Are Finding in 2026: LLM Defense Gaps and Recurring Failure Modes ↗ — ai-alert.org
AI Red Teaming Tools: A Practitioner’s Guide to the Best Frameworks in 2026 ↗ — bestaisecuritytools.com
LLM Alignment Evaluation: Why Benchmark Scores Don’t Predict Production Safety ↗ — guardml.io
Best AI Security Articles: A Curated Reading List for Practitioners ↗ — bestaisecuritytools.com

The Audit Gap: Why Red-Teaming Can't Certify Governance Claims

What Governance Actually Requires

What Red Teams Can Actually See

Fragile Assurance and the Incentive Gradient

What Would Actually Close the Gap

What Changes in the Engagement Playbook

Sources

Sources

AI Sec — in your inbox

Related

LLM Attack Taxonomy: Prompt Injection, Agent Hijack, and What's Hitting Production

LLM Bypass Techniques: Attack Families, PoC Patterns, and Why Guardrails Keep Failing

Prompt Injection Examples: A Practitioner's Attack Library

Comments

What Governance Actually Requires

What Red Teams Can Actually See

Fragile Assurance and the Incentive Gradient

What Would Actually Close the Gap

What Changes in the Engagement Playbook

Sources

Related across the network

Sources

AI Sec — in your inbox

Related

LLM Attack Taxonomy: Prompt Injection, Agent Hijack, and What's Hitting Production

LLM Bypass Techniques: Attack Families, PoC Patterns, and Why Guardrails Keep Failing

Prompt Injection Examples: A Practitioner's Attack Library

Comments