Safety filters can hide incompetent policies—measure whether your learned controller actually earned its safety by training it to minimize filter reliance and testing it without the filter.
This paper addresses a critical problem in safe learning: when a safety filter prevents a policy from violating constraints, we can't tell if the policy actually learned safety or if the filter is just fixing mistakes.