LLMs used as code judges have significant blind spots compared to human developers—they systematically misweight code quality factors like explanation length, meaning you can't rely on them alone for code evaluation in real applications.
This paper introduces TRACE, a framework that compares how LLM judges evaluate code against human developer preferences.