Behavioral evaluations alone cannot verify the safety claims regulators now demand—you need mechanistic evidence like activation analysis to actually verify what's happening inside AI models, not just what they output.
This paper argues that current AI safety evaluation methods (like red-teaming and behavioral testing) cannot verify the deep safety properties that AI governance frameworks now require, such as absence of hidden objectives or resistance to loss-of-control.