VLMs fail at emotion recognition due to two fixable problems: long-tailed training data that biases them toward common emotions, and inability to capture the fleeting temporal changes in facial expressions that are critical for understanding emotions.
Vision-language models struggle with emotion recognition because they inherit dataset biases that collapse rare emotions into common categories, and they can't effectively process the temporal dynamics of facial expressions. This paper identifies these vulnerabilities and proposes using natural language summaries of intermediate frames to preserve emotional context within memory constraints.