Personal AI agents need dedicated visual memory systems that preserve image-specific information beyond captions—explicit entities and implicit user facts that text alone cannot capture.
This paper introduces a benchmark and system for helping AI agents remember personal information from images over long conversations. Most memory systems convert images to text captions, losing visual details about people, objects, and relationships.