Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

Zhaoyu Zhu, Rui Gao, Shuang Li|May 25, 2026arXiv

Key Takeaway

WPG is theoretically sound for continuous control: the Bellman recursion in RL creates favorable convergence properties similar to convex optimization, even though the problem isn't convex.

Summary

This paper proves that Wasserstein Policy Gradient (WPG), an algorithm for reinforcement learning that moves policies using optimal transport geometry, converges globally to optimal solutions. The key insight is that even though RL objectives aren't convex in the traditional sense, the Bellman equation creates a special geometric structure that guarantees convergence.

training

Key Terms

wasserstein-distance policy-gradient entropy-regularization bellman-equation polyak-lojasiewicz-condition