Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

Pratinav Seth, Vinay Kumar Sankarapu|May 14, 2026arXiv

Key Takeaway

Behavioral evaluations alone cannot verify the safety claims regulators now demand—you need mechanistic evidence like activation analysis to actually verify what's happening inside AI models, not just what they output.

Summary

This paper argues that current AI safety evaluation methods (like red-teaming and behavioral testing) cannot verify the deep safety properties that AI governance frameworks now require, such as absence of hidden objectives or resistance to loss-of-control.

safety evaluation alignment

Key Terms

mechanistic-interpretability activation-patching linear-probes red-teaming latent-representations