Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Manan Gupta, Dhruv Kumar|April 16, 2026arXiv

Key Takeaway

LLM judges appear reliable in aggregate but are actually inconsistent on individual inputs; prediction set width reliably indicates per-document difficulty and can serve as a confidence measure for automatic evaluation.

Summary

This paper diagnoses why LLM judges give inconsistent scores for text evaluation. Using two methods—checking if judges contradict themselves and using conformal prediction to quantify uncertainty—the authors show that judges are unreliable on individual documents even when they seem consistent overall.

evaluation safety

Key Terms

conformal-prediction llm-as-a-judge transitivity-violations natural-language-generation-evaluation