Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu et al.|March 25, 2026arXiv

Key Takeaway

LLMs used as code judges have significant blind spots compared to human developers—they systematically misweight code quality factors like explanation length, meaning you can't rely on them alone for code evaluation in real applications.

Summary

This paper introduces TRACE, a framework that compares how LLM judges evaluate code against human developer preferences.

evaluation applications

Key Terms

llm-as-a-judge rubric preference-alignment