When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

Josef Chen|June 25, 2026arXiv

Key Takeaway

Before building a multi-model system, measure how often all your models fail together—this sets a hard ceiling on possible gains. Standard error correlation metrics won't tell you this, but a simple statistical bound will.

Summary

This paper reveals a fundamental limit on multi-model LLM systems: their accuracy gains are capped by how often all models fail together on the same question. The authors measure this 'co-failure rate' across 67 frontier models and show that standard metrics like error correlation miss this ceiling, making it invisible to practitioners.

evaluation scaling agents

Key Terms

multi-agent-ensemble co-failure-rate error-correlation routing mixture-of-experts