OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Qiguang Chen, Chengyu Luan, Jiajun Wu, Qiming Yu, Yi Yang et al.|April 22, 2026arXiv

Key Takeaway

Current vision-language models struggle with multi-image reasoning even on problems they might solve with single images—this benchmark shows that connecting information across multiple images is a major unsolved challenge.

Summary

OMIBench is a benchmark for testing how well vision-language models can solve Olympiad-level problems that require reasoning across multiple images. Unlike existing benchmarks that focus on single images, OMIBench tests whether models can connect evidence scattered across different images to solve complex problems in biology, chemistry, math, and physics.

evaluation multimodal reasoning

Key Terms

vision-language-model multi-image-reasoning benchmark semantic-answer-matching