Vision-language models can be trained to self-correct more effectively by explicitly grounding their reflection in visual inputs, rather than just generating text-based corrections—this matters especially when models encounter out-of-distribution images.
This paper improves how vision-language models correct their own mistakes by training them to look back at images while reasoning. The authors use reinforcement learning with two key techniques: masking earlier reasoning steps to force the model to recover from errors, and replaying diverse failure scenarios. Their method helps models stay accurate even when given unfamiliar images.