MLLMs can improve fine-grained visual understanding by learning from their own superior performance on evidence-focused crops, using on-policy self-distillation to transfer regional perception skills to full-image reasoning.
This paper addresses a key weakness in multimodal AI models: they struggle to notice small but important details in images. The researchers discovered that models actually perform better when shown cropped images focused on relevant areas versus full images, suggesting the problem isn't recognizing details but finding them.