Current vision-language models struggle with multi-image reasoning even on problems they might solve with single images—this benchmark shows that connecting information across multiple images is a major unsolved challenge.
OMIBench is a benchmark for testing how well vision-language models can solve Olympiad-level problems that require reasoning across multiple images. Unlike existing benchmarks that focus on single images, OMIBench tests whether models can connect evidence scattered across different images to solve complex problems in biology, chemistry, math, and physics.