Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley|May 8, 2026arXiv

Key Takeaway

Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.

Summary

This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.

training reasoning evaluation

Key Terms

rubric grpo partial-credit-optimization llm-judge document-grounding