Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

Minji Jung, Minjae Lee, Yejin Kim, Sarang Choi, Minsuk Kahng|April 23, 2026arXiv

Key Takeaway

Model rankings on leaderboards aren't objective—they depend heavily on what evaluation priorities you choose. Interactive, customizable leaderboards could better serve real-world deployment decisions than one-size-fits-all rankings.

Summary

LLM leaderboards rank models using fixed evaluation criteria set by benchmark designers, but different users have different priorities. This paper analyzes the LMArena benchmark dataset and builds an interactive tool that lets users customize which types of prompts matter most to them, showing how model rankings shift based on their specific needs.

evaluation applications

Key Terms

benchmark prompt-slices preference-based-judgments