DemoPSD: Disagreement-Modulated Policy Self-Distillation

Yunhe Li, Hao Shi, Wenhao Liu, Mengzhe Ruan, Hanxu Hou et al.|July 2, 2026arXiv

Key Takeaway

When training reasoning models through self-distillation, selectively adopting teacher guidance based on distribution disagreement prevents information leakage and maintains exploration better than forcing the student to match the teacher exactly.

Summary

DemoPSD improves how LLMs learn to reason by fixing a key problem with standard self-distillation: the teacher model's guidance can leak information the student won't have at test time, hurting generalization.

training reasoning alignment

Key Terms

on-policy-learning self-distillation privileged-information reverse-kl-divergence token-level-supervision