Before building a multi-model system, measure how often all your models fail together—this sets a hard ceiling on possible gains. Standard error correlation metrics won't tell you this, but a simple statistical bound will.
This paper reveals a fundamental limit on multi-model LLM systems: their accuracy gains are capped by how often all models fail together on the same question. The authors measure this 'co-failure rate' across 67 frontier models and show that standard metrics like error correlation miss this ceiling, making it invisible to practitioners.