Self-Distilled Agentic Reinforcement Learning

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu et al.|May 14, 2026arXiv

Key Takeaway

Combining RL with selective token-level distillation through a gating mechanism significantly improves LLM agent performance on complex tasks, achieving 7-10% gains over standard RL approaches while avoiding training instability.

Summary

This paper improves how language model agents learn through reinforcement learning by combining trajectory-level rewards with dense token-level guidance. The key innovation is a gating mechanism that selectively uses teacher signals—strengthening learning from good decisions and softly ignoring bad teacher suggestions—making multi-turn agent training more stable and effective.

agents training

Key Terms

agentic-reinforcement-learning self-distillation token-level-reward gating-mechanism privileged-context