Vision-language models perform well at describing traffic scenes but fail at reasoning about crash mechanics, causality, and temporal progression—critical gaps for infrastructure-based autonomous driving safety systems.
CrashSight is a benchmark dataset of 250 real-world traffic crash videos with 13K questions designed to test how well AI vision-language models understand crash scenes from roadside cameras. The benchmark reveals that current models struggle with temporal reasoning and causal analysis in safety-critical scenarios, despite being good at describing scenes.