Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying|June 17, 2026arXiv

Key Takeaway

Using structured rubrics as fine-grained feedback during training helps reasoning models learn better than scalar rewards or single reference solutions, because rubrics specify what makes a good response without forcing the model to copy one specific reasoning path.

Summary

This paper proposes a new training method for reasoning language models that uses detailed rubrics (scoring criteria) instead of single correct answers or scalar rewards. The approach has a teacher model generate token-level feedback based on rubrics, guiding a student model's own reasoning steps. This provides more nuanced learning signals than traditional distillation or reward-based methods.

training reasoning

Key Terms

rubric-generation token-level-guidance criterion-level-feedback self-distillation