Using structured rubrics as fine-grained feedback during training helps reasoning models learn better than scalar rewards or single reference solutions, because rubrics specify what makes a good response without forcing the model to copy one specific reasoning path.
This paper proposes a new training method for reasoning language models that uses detailed rubrics (scoring criteria) instead of single correct answers or scalar rewards. The approach has a teacher model generate token-level feedback based on rubrics, guiding a student model's own reasoning steps. This provides more nuanced learning signals than traditional distillation or reward-based methods.