Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

Jai Moondra, Ayela Chughtai, Bhargavi Lanka, Swati Gupta|May 7, 2026arXiv

Key Takeaway

Don't trust global LLM leaderboards—they hide structured disagreement across languages and tasks. Use language-specific rankings or small model portfolios instead to match diverse user needs.

Summary

Current LLM leaderboards rank models using global voting patterns, but this masks the reality: opinions differ dramatically by language and task. This paper shows that 2/3 of votes cancel out and top models are statistically indistinguishable globally. Instead, grouping by language reveals coherent subpopulations with consistent rankings.

evaluation multimodal

Key Terms

bradley-terry-model leaderboard set-cover-problem vc-dimension