When AI judges appear to agree on scores but are actually using shallow patterns rather than substantive reasoning about quality.