AI Sec
A rack of servers
red-team

The Audit Gap: Why Red-Teaming Can't Certify Governance Claims

A new position paper by Seth and Sankarapu formalizes the structural mismatch between what AI governance frameworks require evaluators to verify and what behavioral assurance methods can epistemically support—and the implications for anyone writing safety reports.

By AI Sec Editorial · · 8 min read

A position paper posted to arXiv on May 14, 2026 by Pratinav Seth and Vinay Kumar Sankarapu at Lexsi Labs puts a formal name to a problem practitioners already live with: governance frameworks enacted between 2019 and early 2026 are demanding verifiable evidence of safety properties that no currently deployed evaluation methodology can actually verify. The paper calls this the audit gap — the divergence between required and achievable verification access — and introduces fragile assurance to describe evaluations where the evidential structure doesn’t support the safety claim being asserted. For anyone writing red team reports that get attached to compliance filings, this is worth reading carefully.

What Governance Actually Requires

Every major regulatory framework that now touches frontier model deployment — the EU AI Act, NIST AI RMF, Colorado AI Act, ISO 42001 — organizes obligations around demonstrable evidence of specific high-consequence properties. The paper catalogs these across a 21-instrument inventory and identifies a consistent demand for three categories:

  • Absence of hidden objectives: the model does not pursue goals it was not designed to pursue
  • Resistance to loss-of-control precursors: the model does not exhibit deceptive alignment or goal-directed behavior that bypasses human oversight
  • Bounded catastrophic capability: the model cannot cause harm above some threshold under any reachable input distribution

These are latent properties. They live in internal representations, training dynamics, and long-horizon behavioral tendencies — not in any observable output sequence you can elicit during a bounded evaluation window.

What Red Teams Can Actually See

Behavioral evaluation and red-teaming operate exclusively on model outputs. You give the model inputs. You observe what it produces. You vary the inputs systematically to probe for failure modes. This is genuinely useful for finding active vulnerabilities — prompt injection vectors, jailbreak surfaces, tool-call abuse patterns, content-filter bypasses. It is not useful for establishing the absence of something you cannot directly observe.

The core epistemic problem is that a clean red team engagement does not distinguish between a model that has no hidden objectives and a model that has hidden objectives it does not reveal under the input distribution you tested. A model exhibiting deceptive alignment would, by definition, pass a behavioral evaluation — that is what deceptive alignment means. Similarly, you cannot establish “bounded catastrophic capability” from behavioral outputs alone without exhaustively sampling the input space, which is intractable.

Seth and Sankarapu formalize this as the audit gap: regulators are asking for verification at the level of latent representations and long-horizon agentic behavior, while evaluators have access only to the surface of observable outputs. The paper is not arguing that red-teaming is useless. It is arguing that red team reports are being stretched to certify claims they structurally cannot support.

Fragile Assurance and the Incentive Gradient

The more interesting — and more uncomfortable — observation in the paper concerns why this mismatch persists. The authors identify an incentive gradient: industry needs to demonstrate compliance; governance bodies need to accept some evidence; behavioral evaluations and red team reports are the only evidence that currently exists at scale; so governance frameworks ratify them as sufficient proof.

The result is what the paper calls fragile assurance: an evaluation that produces documentation structured to look like safety certification but where the evidential chain breaks down if you examine what the evidence actually supports. The 21-instrument analysis found this pattern is not an edge case — it is characteristic of the current evaluation landscape across multiple jurisdictions.

From a practitioner standpoint, this should be familiar. You deliver a red team report. Legal takes it, marks several checkboxes in a compliance worksheet, and an executive signs a declaration that the system has been safety-evaluated. The report accurately describes what you found. The declaration claims far more than the report supports. Nobody lied; the institutional machinery just laundered the epistemically limited finding into a stronger claim.

The Brookings Institution’s analysis of agentic AI evaluation reaches a similar conclusion from a policy angle: “benchmark-based evaluation cannot substitute for real-world, in-context assessments” for systems that operate through sustained environmental interaction. The behavioral testing paradigm was built for characterizing observable output distributions — it was not designed to establish claims about what a system will do over long time horizons in deployment contexts it was never tested in.

What Would Actually Close the Gap

Seth and Sankarapu propose two concrete directions. First, constrain the legal weight of behavioral evidence in regulatory text — governance bodies should be explicit that red team reports and behavioral evaluations establish proxy evidence, not proof of the high-consequence properties they are being used to certify. Second, expand pre-deployment access to mechanistic evidence:

  • Linear probes on internal activations to detect goal representations
  • Activation patching to test causal claims about model internals
  • Before/after training comparisons to characterize what training changed at the representational level

These are standard mechanistic interpretability techniques. The problem is that external evaluators do not typically get access to model internals. A red team contractor can probe outputs; they cannot run activation patching against a closed model’s intermediate layers. Closing the audit gap at the mechanistic level requires that model developers either run these analyses themselves under third-party oversight, or provide sufficient internal access for evaluators to run them independently.

Neither is standard practice. The paper acknowledges that mechanistic interpretability is not yet mature enough to fully close the gap on its own — but the argument is that the correct response to an immature verification technology is to be honest about what current methods cannot certify, not to ratify behavioral proxies as equivalent to the stronger claim.

What Changes in the Engagement Playbook

If you write red team reports that feed compliance processes, the immediate implication is epistemic hygiene in your deliverables. “No vulnerabilities found in scope” and “the system does not have hidden objectives” are not equivalent statements, and conflating them in a report — or allowing a client to conflate them — creates exactly the fragile assurance the paper describes. Scope your claims to what your methodology supports.

For agentic systems specifically, the HackerOne guidance on governance-ready red team reporting correctly flags that agentic taxonomies need to account for tool-call abuse, multi-step goal pursuit, and environmental persistence — behaviors that don’t surface in static prompt-response evaluation. Even well-scoped behavioral coverage of an agent is not the same as characterizing what it does over extended autonomous operation.

The longer-term implication is that AI red teamers should be arguing for mechanistic access in engagement terms. If a client wants a safety evaluation that actually supports the claims their governance team intends to make, they need to either provide internal model access or commission analyses from the model developer. Behavioral-only engagements cannot certify absence of hidden objectives. Saying so clearly in your report is both accurate and increasingly necessary.

Sources

Sources

  1. Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands (arXiv:2605.15164v1)
  2. How Can We Best Evaluate Agentic AI? — Brookings Institution
  3. AI Red Teaming: Agentic Taxonomies and Governance-Ready Reporting — HackerOne
Subscribe

AI Sec — in your inbox

Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments