LLM judges appear reliable in aggregate but are actually inconsistent on individual inputs; prediction set width reliably indicates per-document difficulty and can serve as a confidence measure for automatic evaluation.
This paper diagnoses why LLM judges give inconsistent scores for text evaluation. Using two methods—checking if judges contradict themselves and using conformal prediction to quantify uncertainty—the authors show that judges are unreliable on individual documents even when they seem consistent overall.