Why Do Vision Language Models Struggle To Recognize Human Emotions?

Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara, Steven McDonagh|April 16, 2026arXiv

Key Takeaway

VLMs fail at emotion recognition due to two fixable problems: long-tailed training data that biases them toward common emotions, and inability to capture the fleeting temporal changes in facial expressions that are critical for understanding emotions.

Summary

Vision-language models struggle with emotion recognition because they inherit dataset biases that collapse rare emotions into common categories, and they can't effectively process the temporal dynamics of facial expressions. This paper identifies these vulnerabilities and proposes using natural language summaries of intermediate frames to preserve emotional context within memory constraints.

multimodal evaluation data

Key Terms

vision-language-models long-tailed-distribution temporal-information context-window micro-expressions