Vision-language models can identify visual features but fail at inferring structured cultural metadata from images, with significant performance gaps across different cultural regions—a critical limitation for cultural heritage applications.
This paper creates a benchmark to test how well vision-language models can extract structured cultural information (like creator, origin, period) from images of cultural artifacts. The researchers find that current models struggle with this task, showing inconsistent performance across different cultures and metadata types, revealing gaps in cultural reasoning beyond basic visual recognition.