When training vision-language models with reinforcement learning, enforcing that reasoning steps must be logically consistent and visually grounded—not just accurate—produces better explanations and even improves final answer accuracy.
This paper identifies a critical problem in multimodal AI models: they achieve high accuracy on visual reasoning tasks but produce reasoning explanations that contradict their answers and don't match what's actually in the image.