Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun et al.|May 18, 2026arXiv

Key Takeaway

MLLMs can improve fine-grained visual understanding by learning from their own superior performance on evidence-focused crops, using on-policy self-distillation to transfer regional perception skills to full-image reasoning.

Summary

This paper addresses a key weakness in multimodal AI models: they struggle to notice small but important details in images. The researchers discovered that models actually perform better when shown cropped images focused on relevant areas versus full images, suggesting the problem isn't recognizing details but finding them.

multimodal training efficiency

Key Terms

self-distillation on-policy-learning fine-grained-visual-details token-level-divergence regional-to-global-perception