Clinician-authored rubrics can be validated and partially replaced by LLM-generated ones, enabling scalable clinical AI evaluation that maintains expert oversight while reducing evaluation costs from expensive to nearly automatic.
This paper presents a practical methodology for evaluating clinical AI systems using case-specific rubrics written by clinicians. The researchers tested whether AI-generated rubrics could match clinician judgment across 823 real and synthetic clinical cases, finding that LLM-based scoring achieved similar agreement levels to clinician-to-clinician agreement at 1,000x lower cost.