MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang et al.|June 5, 2026arXiv

Key Takeaway

For long-form video understanding, decoupling perception (building structured memory) from reasoning (agentic exploration) is more efficient than end-to-end processing, achieving better accuracy while using only 2% of the context that full-video processing would require.

Summary

MemDreamer solves the problem of understanding very long videos by splitting the task into two parts: a perception system that builds a memory structure from video frames, and a reasoning system that explores this memory like an agent using tools.

multimodal agents reasoning

Key Terms

hierarchical-memory agentic-retrieval-mechanism long-context-handling vision-language-model token-explosion