When training vision-language models to reason step-by-step, you need to explicitly enforce that the reasoning process logically leads to the final answer—not just optimize for getting the right answer.
This paper identifies and fixes a problem in multimodal AI models where their reasoning process doesn't match their final answer. The authors propose CORA, a method that adds a consistency check during training to ensure the model's thinking aligns with what it concludes, improving both accuracy and reasoning reliability.