A measure of how much multiple teacher models agree on their predictions, used to assess supervision reliability.