Vision-language models perform surprisingly poorly on domain-specific action recognition even in simplified settings, but fine-tuning on domain-specific video data significantly closes the gap.
VideoNet is a new benchmark and dataset for testing how well AI models recognize specific actions in videos across 37 different domains. The researchers found that current vision-language models struggle with domain-specific action recognition—even simple binary choices—and created a 500k video question-answer dataset to improve performance through fine-tuning.