Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Liyan Tang, Fangcong Yin, Greg Durrett|July 2, 2026arXiv

Key Takeaway

Vision-language models can be trained to self-correct more effectively by explicitly grounding their reflection in visual inputs, rather than just generating text-based corrections—this matters especially when models encounter out-of-distribution images.

Summary

This paper improves how vision-language models correct their own mistakes by training them to look back at images while reasoning. The authors use reinforcement learning with two key techniques: masking earlier reasoning steps to force the model to recover from errors, and replaying diverse failure scenarios. Their method helps models stay accurate even when given unfamiliar images.

reasoning training multimodal

Key Terms

chain-of-thought self-reflection reinforcement-learning out-of-distribution experience-replay-buffer