Hallucinations in vision-language models are primarily caused by over-reliance on textual instructions rather than vision limitations—and preference-based fine-tuning can effectively reduce this by teaching models to prioritize visual grounding.
Vision-language models often generate false descriptions that aren't supported by images, especially when text instructions are misleading. This paper introduces HalluScope, a benchmark to measure when and why this happens, and HalluVL-DPO, a fine-tuning method that teaches models to trust images over text instructions by learning from examples of correct vs. hallucinated responses.