Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, Tanuja Ganu|April 9, 2026arXiv

Key Takeaway

When training vision-language models with reinforcement learning, enforcing that reasoning steps must be logically consistent and visually grounded—not just accurate—produces better explanations and even improves final answer accuracy.

Summary

This paper identifies a critical problem in multimodal AI models: they achieve high accuracy on visual reasoning tasks but produce reasoning explanations that contradict their answers and don't match what's actually in the image.

reasoning multimodal training

Key Terms

grpo chain-of-thought visual-grounding reinforcement-learning-from-verifiable-rewards lagrangian-dual-ascent