Most vision-language models struggle with knowledge-grounded visual reasoning—even large models only reach 75% accuracy when questions require combining visual evidence with external facts, suggesting a major gap in real-world VQA capabilities.
WikiVQABench is a new benchmark for testing vision-language models on questions that require both visual understanding and external knowledge from Wikipedia and Wikidata.