You can accelerate RL training by blending a working baseline policy with a learnable policy, progressively shifting control to the learner—this keeps success rates high early on while producing a better final policy.
This paper presents a method for training RL policies more efficiently by starting with an existing suboptimal policy and gradually transferring control to a new learning policy. The approach maintains high success rates throughout training and produces a final standalone policy that outperforms the baseline, without requiring the baseline at test time.