Maintaining consistent characters and objects across long video sequences is hard; explicit memory of each entity's appearance significantly improves consistency, especially when characters reappear after many shots.
EntityBench is a benchmark for evaluating multi-shot video generation—creating coherent video sequences with multiple scenes. It includes 140 episodes with detailed tracking of characters, objects, and locations across shots, plus an evaluation system that measures both video quality and consistency.