Don't trust global LLM leaderboards—they hide structured disagreement across languages and tasks. Use language-specific rankings or small model portfolios instead to match diverse user needs.
Current LLM leaderboards rank models using global voting patterns, but this masks the reality: opinions differ dramatically by language and task. This paper shows that 2/3 of votes cancel out and top models are statistically indistinguishable globally. Instead, grouping by language reveals coherent subpopulations with consistent rankings.