Model rankings on leaderboards aren't objective—they depend heavily on what evaluation priorities you choose. Interactive, customizable leaderboards could better serve real-world deployment decisions than one-size-fits-all rankings.
LLM leaderboards rank models using fixed evaluation criteria set by benchmark designers, but different users have different priorities. This paper analyzes the LMArena benchmark dataset and builds an interactive tool that lets users customize which types of prompts matter most to them, showing how model rankings shift based on their specific needs.