LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu et al.|May 12, 2026arXiv

Key Takeaway

Long-term memory for agents requires more than just storing task outcomes; agents need to internalize environment-specific patterns, workflows, and failure modes to become truly experienced colleagues, and current memory systems still struggle with this despite recent advances.

Summary

This paper introduces LongMemEval-V2, a benchmark for testing whether AI agents can build long-term memory of specialized web environments. It includes 451 questions about five types of memory (state recall, workflow knowledge, failure modes, etc.) paired with massive history trajectories up to 500 steps and 115M tokens.

agents evaluation reasoning

Key Terms

rag-pipeline episodic-memory context-gathering agent-trajectory