Eye-tracking analysis can be enriched by measuring semantic similarity of attended regions using VLMs and NLP metrics, capturing content agreement that spatial-only metrics miss.
This paper proposes a new way to compare eye-tracking scanpaths by focusing on what people looked at semantically, not just where spatially. Using vision-language models, the researchers convert fixation points into text descriptions and measure similarity using NLP metrics, revealing that two people can look at different locations but see the same meaningful content.