You can now formally verify AI agent security policies with probabilistic components (like imperfect detectors) and get mathematical guarantees on violation rates, even when you don't know how errors correlate.
This paper presents a framework for verifying that AI agents follow security policies even when using unreliable components like PII detectors or classifiers that sometimes fail. Unlike existing approaches that assume perfect detection, this method computes guaranteed upper bounds on policy violations using robust optimization, without requiring assumptions about how errors correlate.