When building video world models, memory capacity matters more than compression, and the structure of how memory is accessed (like state-space recurrence) is as important as whether you use memory at all.
This paper systematically compares different memory mechanisms in video generation models that create multi-segment videos from text and camera actions.