A measure of how consistently different judges rate the same outputs, typically using metrics like correlation or ICC.