Visual Preference Optimization with Rubric Rewards

Ya-Qi Yu, Fangyu Hong, Xiangyang Qu, Hao Wang, Gaojie Wu et al.|April 14, 2026arXiv

Key Takeaway

Using detailed, instruction-specific rubrics to score model outputs significantly improves preference-based training for vision tasks—achieving 82.69% on benchmarks versus 75.82% with simpler outcome-based scoring.

Summary

This paper introduces rDPO, a method for improving visual AI models by using detailed rubrics (checklists of criteria) to evaluate and rank image responses. Instead of simple yes/no judgments, the approach creates specific evaluation criteria for each image-instruction pair, which helps the model learn finer distinctions in visual reasoning tasks.

training evaluation multimodal

Key Terms

direct-preference-optimization rubric on-policy-data criterion-level-feedback