PGP achieves global convergence guarantees for constrained exploration by reformulating the problem with penalty regularization, ensuring you get a single deployable policy that satisfies constraints, not just theoretical averages.
This paper solves a hard problem in reinforcement learning: how to explore efficiently while respecting constraints like safety or resource limits. The authors propose PGP, a method that uses penalty-based regularization to handle constraints while maximizing exploration entropy.