When a benchmark becomes too easy and models achieve near-perfect scores, making it impossible to compare their true abilities.