Reinforcement Learning from Rich Feedback with Distributional DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad|June 3, 2026arXiv

Key Takeaway

Rich feedback signals (execution traces, intermediate corrections, self-evaluations) can improve reasoning model training more than binary right/wrong rewards, and forward cross-entropy loss provides better credit assignment and theoretical guarantees than reverse KL approaches.

Summary

This paper introduces DistIL, a method for training reasoning models using rich feedback (like execution traces and expert corrections) instead of just right/wrong labels. It adapts DAgger, a classic imitation learning algorithm, to work with distributional expert knowledge and uses forward cross-entropy loss to assign credit to earlier decisions.

training reasoning alignment

Key Terms

dagger reinforcement-learning-from-verifiable-rewards forward-kl-divergence execution-trace-feedback credit-assignment