Rich feedback signals (execution traces, intermediate corrections, self-evaluations) can improve reasoning model training more than binary right/wrong rewards, and forward cross-entropy loss provides better credit assignment and theoretical guarantees than reverse KL approaches.
This paper introduces DistIL, a method for training reasoning models using rich feedback (like execution traces and expert corrections) instead of just right/wrong labels. It adapts DAgger, a classic imitation learning algorithm, to work with distributional expert knowledge and uses forward cross-entropy loss to assign credit to earlier decisions.