When designing agent benchmarks, treat tasks as adversarial tests rather than helpful prompts; focus on conceptual difficulty over environmental complexity, and rigorously verify that your evaluation logic actually measures what you intend.
This paper provides practical guidelines for designing high-quality benchmark tasks that evaluate AI agents' coding and system-administration abilities.