Evaluation method where human raters compare two model outputs and indicate which one is better, rather than scoring them independently.