You can't verify general agents work everywhere, but you can certify they work reliably on specific transitions by examining their internal world model—this enables practical deployment of capable agents in complex environments.
This paper addresses how to verify that general-purpose AI agents work reliably in complex environments. Instead of checking if an agent handles everything perfectly, the authors propose 'structural certification'—a method that identifies which specific situations the agent understands well and which it doesn't.