Game benchmarks should measure agent improvement over time through iterative refinement, not just first-attempt performance—this reveals which VLMs can learn and adapt in interactive environments.
OmniGameArena is a benchmark for testing vision-language model agents in 12 Unreal Engine 5 games across different play modes (solo, competitive, cooperative). It introduces Improvement Dynamics Curve, which measures how agents improve when given multiple chances to refine their strategies through self-reflection, revealing performance evolution beyond single-attempt scores.