RLHF pipelines should explicitly choose whether human annotators are extending designer intent, providing evidence about facts, or exercising authority—and use different validation and aggregation methods for each, rather than treating all annotations the same way.
This paper examines how human feedback shapes AI behavior through RLHF, identifying three distinct conceptual models: extension (annotators extend designer judgments), evidence (annotators provide factual information), and authority (annotators represent population preferences).