CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen et al.|April 9, 2026arXiv

Key Takeaway

Vision-language models perform well at describing traffic scenes but fail at reasoning about crash mechanics, causality, and temporal progression—critical gaps for infrastructure-based autonomous driving safety systems.

Summary

CrashSight is a benchmark dataset of 250 real-world traffic crash videos with 13K questions designed to test how well AI vision-language models understand crash scenes from roadside cameras. The benchmark reveals that current models struggle with temporal reasoning and causal analysis in safety-critical scenarios, despite being good at describing scenes.

evaluation multimodal safety

Key Terms

vision-language-models visual-grounding causal-reasoning temporal-reasoning