Long-term memory for agents requires more than just storing task outcomes; agents need to internalize environment-specific patterns, workflows, and failure modes to become truly experienced colleagues, and current memory systems still struggle with this despite recent advances.
This paper introduces LongMemEval-V2, a benchmark for testing whether AI agents can build long-term memory of specialized web environments. It includes 451 questions about five types of memory (state recall, workflow knowledge, failure modes, etc.) paired with massive history trajectories up to 500 steps and 115M tokens.