Global Optimality for Constrained Exploration via Penalty Regularization

Florian Wolf, Ilyas Fatkhullin, Niao He|April 30, 2026arXiv

Key Takeaway

PGP achieves global convergence guarantees for constrained exploration by reformulating the problem with penalty regularization, ensuring you get a single deployable policy that satisfies constraints, not just theoretical averages.

Summary

This paper solves a hard problem in reinforcement learning: how to explore efficiently while respecting constraints like safety or resource limits. The authors propose PGP, a method that uses penalty-based regularization to handle constraints while maximizing exploration entropy.

Key Terms

constrained-reinforcement-learning occupancy-measure entropy-maximization policy-gradient-theorem penalty-regularization