Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank, Fosca Giannotti|May 29, 2026arXiv

Key Takeaway

When evaluating hate speech detection systems, using soft labels and explanations that capture human disagreement produces more reliable results than forcing agreement through majority voting.

Summary

This paper examines how human disagreement affects both labels and explanations in hate speech detection. The researchers unified different evaluation approaches and tested how well models perform when trained on different representations of labels and rationales (explanations), finding that softer representations better capture human variation and disagreement.

evaluation safety

Key Terms

inter-annotator-agreement rationale soft-labels explainability faithfulness