CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

Sanghyuk Chun, William Yang, Amaya Dharmasiri, Olga Russakovsky|June 30, 2026arXiv

Key Takeaway

Breaking uncertainty into interpretable components—what's ambiguous about the task versus how many right answers exist—lets you estimate confidence efficiently in multimodal models without expensive sampling.

Summary

CoMet decomposes uncertainty in multimodal AI models into two components: context-specific ambiguity (from the task or prompt) and multiplicity (how many valid answers exist). A lightweight module estimates these without generating multiple answers, enabling efficient uncertainty quantification for open-ended tasks like visual question answering.

multimodal evaluation safety

Key Terms

uncertainty-estimation multimodal-large-language-model post-hoc-uncertainty-module hallucination-detection