Splitting sparse episode outcomes into separate success and efficiency signals with state-adaptive weighting, plus intervention-aware credit assignment, enables effective online RL fine-tuning of robot policies from minimal supervision.
This paper solves a key problem in robot learning: when fine-tuning pretrained vision-language-action models through trial-and-error, each episode only gives a binary success/failure signal, but the model needs per-step feedback.