Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Akshay Paruchuri, Sanmi Koyejo, Ehsan Adeli|June 24, 2026arXiv

Key Takeaway

Multimodal AI models are unreliably sensitive to input order—a property that should be baseline for production systems. Simple prompt fixes don't solve this; the problem likely requires changes during model training or design.

Summary

This paper audits 18 multimodal AI models to check if they give consistent answers when information is presented in different orders. The researchers found that all models fail this basic reliability test, with 24-50% of answers changing based on order.

evaluation safety multimodal

Key Terms

multimodal-large-language-model order-invariance item-response-theory decoder-stochasticity