Evaluating models by comparing outputs two at a time, which scales quadratically with the number of models.