CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Jiayue Cao, Zhicong Lu, Xuehan Sun, Wei Jia, Hongling Zheng et al.|June 12, 2026arXiv

Key Takeaway

When training vision-language models to reason step-by-step, you need to explicitly enforce that the reasoning process logically leads to the final answer—not just optimize for getting the right answer.

Summary

This paper identifies and fixes a problem in multimodal AI models where their reasoning process doesn't match their final answer. The authors propose CORA, a method that adds a consistency check during training to ensure the model's thinking aligns with what it concludes, improving both accuracy and reasoning reliability.

reasoning multimodal training

Key Terms

reinforcement-learning-from-verifiable-rewards grpo consistency-oriented-reasoning-alignment vision-language-model