ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian et al.|June 12, 2026arXiv

Key Takeaway

Medical AI hallucinations have different sources (visual, knowledge, reasoning); diagnosing which stage fails helps you fix the right problem and improve trustworthiness.

Summary

ClinHallu is a benchmark with 7,031 medical cases that diagnoses where hallucinations occur in medical AI systems—whether from misreading images, recalling wrong medical facts, or flawed reasoning. It includes detailed reasoning traces and shows that training on these traces reduces errors.

evaluation safety multimodal

Key Terms

hallucination multimodal-large-language-model reasoning-trace trace-supervised-fine-tuning