For long-form video understanding, decoupling perception (building structured memory) from reasoning (agentic exploration) is more efficient than end-to-end processing, achieving better accuracy while using only 2% of the context that full-video processing would require.
MemDreamer solves the problem of understanding very long videos by splitting the task into two parts: a perception system that builds a memory structure from video frames, and a reasoning system that explores this memory like an agent using tools.