Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.
This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.