Agent evaluation should use standardized protocols and agent-based judges instead of fixed benchmarks—this makes comparing different agent designs fair and reproducible at scale.
AgentBeats proposes a new way to evaluate AI agents using other agents as judges, rather than fixed benchmarks. Instead of building separate evaluation systems for each agent type, all agents communicate through standardized protocols (A2A and MCP), making evaluation fairer, more reproducible, and compatible with real-world constraints like privacy and openness.