When training models to improve without external feedback, align your critique to the model's actual reasoning steps—this focuses learning on real errors rather than forcing unnecessary changes to correct behavior.
This paper studies how to design feedback for self-distillation in language models. The key finding: step-by-step critiques aligned with the model's reasoning trace work better than binary rewards or reference solutions because they target only the tokens where reasoning actually fails, leaving correct steps unchanged.