QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin et al.|June 2, 2026arXiv

Key Takeaway

Co-designing queries and rubrics together—rather than optimizing rubrics alone—solves a key bottleneck in rubric-based RL: vague queries lead to unusable rubrics, but overly narrow queries create unverifiable references that block learning.

Summary

QUBRIC co-designs queries and rubrics to enable reinforcement learning on tasks without verifiable rewards. The method transforms open-ended questions into scenario-based queries grounded in teacher insights, generates contrastive rubrics, and filters for learnability. It achieves +5.5 point gains on ArenaHard and transfers to legal, moral, and narrative reasoning tasks.

training reasoning evaluation

Key Terms

rubric-generation grpo reinforcement-learning-from-verifiable-rewards learnability-filtering contrastive-rubric-generation