A measure of uncertainty in model outputs based on whether different responses have the same meaning, without needing correct answers.