An Agency-Transferring Model-Free Policy Enhancement Technique

Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko|June 8, 2026arXiv

Key Takeaway

You can accelerate RL training by blending a working baseline policy with a learnable policy, progressively shifting control to the learner—this keeps success rates high early on while producing a better final policy.

Summary

This paper presents a method for training RL policies more efficiently by starting with an existing suboptimal policy and gradually transferring control to a new learning policy. The approach maintains high success rates throughout training and produces a final standalone policy that outperforms the baseline, without requiring the baseline at test time.

training efficiency reasoning

Key Terms

policy-gradient baseline-policy goal-reaching-probability agency-transfer