What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang, Aladine Chetouani, Ulas Bagci et al.|April 9, 2026arXiv

Key Takeaway

Eye-tracking analysis can be enriched by measuring semantic similarity of attended regions using VLMs and NLP metrics, capturing content agreement that spatial-only metrics miss.

Summary

This paper proposes a new way to compare eye-tracking scanpaths by focusing on what people looked at semantically, not just where spatially. Using vision-language models, the researchers convert fixation points into text descriptions and measure similarity using NLP metrics, revealing that two people can look at different locations but see the same meaningful content.

multimodal evaluation

Key Terms

scanpath vision-language-models semantic-similarity fixation