Skill-Conditioned Gated Self-Distillation for LLM Reasoning

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu et al.|May 27, 2026arXiv

Key Takeaway

You can train reasoning models with imperfect, reusable skills from past experience rather than perfect reference answers, by having multiple skill-based teachers vote on whether they help or hurt—and the model learns from disagreements between teachers.

Summary

This paper improves how language models learn to reason by using a skill bank—a collection of past problem-solving techniques—as training guidance. Instead of assuming perfect reference answers, the method validates whether retrieved skills actually help or hurt on new problems, then uses this validation to train the model more effectively.

training reasoning

Key Terms

self-distillation skill-bank privileged-information verifier on-policy-learning