When training reasoning models through self-distillation, selectively adopting teacher guidance based on distribution disagreement prevents information leakage and maintains exploration better than forcing the student to match the teacher exactly.
DemoPSD improves how LLMs learn to reason by fixing a key problem with standard self-distillation: the teacher model's guidance can leak information the student won't have at test time, hurting generalization.