Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

Tian Zheng, Kai-Tai Hsu|June 23, 2026arXiv

Key Takeaway

Evaluating agentic systems requires multi-layered grading strategies with different failure modes; a cascade combining strict pattern matching, lenient LLM grading, and human review is more reliable than any single approach.

Summary

This paper tackles the challenge of evaluating agentic data analysis systems that produce complex outputs like code, results, and explanations. The authors develop a three-layer grading cascade combining regex matching, LLM-based evaluation, and human review, achieving 97% recall while maintaining 100% precision. They show that iterative nudging improves grading success from 36% to 97%.

evaluation agents reasoning

Key Terms

agentic-systems grading-cascade recall precision iterative-nudging