Attention-based hallucination detection is fundamentally flawed due to confounders; HaloProbe's Bayesian approach separates external and internal signals to detect hallucinations more reliably and mitigate them without degrading model performance.
Vision-language models often hallucinate objects that aren't in images. This paper shows that using attention weights to detect hallucinations is unreliable due to hidden confounders like token position.