A measure of how consistently different evaluators score or judge the same items, often using metrics like Kendall's tau.