Evaluating agentic systems requires multi-layered grading strategies with different failure modes; a cascade combining strict pattern matching, lenient LLM grading, and human review is more reliable than any single approach.
This paper tackles the challenge of evaluating agentic data analysis systems that produce complex outputs like code, results, and explanations. The authors develop a three-layer grading cascade combining regex matching, LLM-based evaluation, and human review, achieving 97% recall while maintaining 100% precision. They show that iterative nudging improves grading success from 36% to 97%.