Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui et al.|April 27, 2026arXiv

Key Takeaway

Clinician-authored rubrics can be validated and partially replaced by LLM-generated ones, enabling scalable clinical AI evaluation that maintains expert oversight while reducing evaluation costs from expensive to nearly automatic.

Summary

This paper presents a practical methodology for evaluating clinical AI systems using case-specific rubrics written by clinicians. The researchers tested whether AI-generated rubrics could match clinician judgment across 823 real and synthetic clinical cases, finding that LLM-based scoring achieved similar agreement levels to clinician-to-clinician agreement at 1,000x lower cost.

evaluation safety applications

Key Terms

rubric inter-rater-agreement clinical-validation ehr-embedded-ai-agent ceiling-compression