Vision-language models need separate confidence scores for perception and reasoning, not a single overall confidence score, to better detect hallucinations and improve reliability in real-world applications.
This paper addresses a critical problem in vision-language models: they often give confident wrong answers, especially in high-stakes applications. The authors propose VL-Calibration, which separates confidence into two parts—visual confidence (did the model see the right thing?) and reasoning confidence (did it think correctly about what it saw?)—using reinforcement learning.