VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Wenyi Xiao, Xinchi Xu, Leilei Gan|April 10, 2026arXiv

Key Takeaway

Vision-language models need separate confidence scores for perception and reasoning, not a single overall confidence score, to better detect hallucinations and improve reliability in real-world applications.

Summary

This paper addresses a critical problem in vision-language models: they often give confident wrong answers, especially in high-stakes applications. The authors propose VL-Calibration, which separates confidence into two parts—visual confidence (did the model see the right thing?) and reasoning confidence (did it think correctly about what it saw?)—using reinforcement learning.

safety multimodal evaluation

Key Terms

calibration hallucination vision-language-model reinforcement-learning token-entropy