Autonomous policy improvement requires agents to discover task-specific mechanisms and efficiently convert feedback into parameter updates under constrained budgets—not just win individual tasks.
EvoPolicyGym is a benchmark for evaluating how AI agents autonomously improve executable policies through iterative editing and feedback.