When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex et al.|May 6, 2026arXiv

Key Takeaway

You can bootstrap reinforcement learning from behavior cloning by extracting Q-values from the cloned policy, then use those values to safely blend cloned and learned actions—this prevents catastrophic forgetting while enabling rapid online improvement on real robots.

Summary

This paper presents Q2RL, a method that extracts Q-functions from behavior cloning policies to enable efficient online learning on robots. By intelligently switching between cloned and learned actions based on their estimated values, the approach avoids forgetting good behaviors while improving through real-world interaction, achieving 100% success on complex manipulation tasks in 1-2 hours.

Key Terms

behavior-cloning offline-to-online-learning q-function distribution-mismatch policy-blending