Combining RL with selective token-level distillation through a gating mechanism significantly improves LLM agent performance on complex tasks, achieving 7-10% gains over standard RL approaches while avoiding training instability.
This paper improves how language model agents learn through reinforcement learning by combining trajectory-level rewards with dense token-level guidance. The key innovation is a gating mechanism that selectively uses teacher signals—strengthening learning from good decisions and softly ignoring bad teacher suggestions—making multi-turn agent training more stable and effective.