You can bootstrap reinforcement learning from behavior cloning by extracting Q-values from the cloned policy, then use those values to safely blend cloned and learned actions—this prevents catastrophic forgetting while enabling rapid online improvement on real robots.
This paper presents Q2RL, a method that extracts Q-functions from behavior cloning policies to enable efficient online learning on robots. By intelligently switching between cloned and learned actions based on their estimated values, the approach avoids forgetting good behaviors while improving through real-world interaction, achieving 100% success on complex manipulation tasks in 1-2 hours.