Three Models of RLHF Annotation: Extension, Evidence, and Authority

Steve Coyne|April 28, 2026arXiv

Key Takeaway

RLHF pipelines should explicitly choose whether human annotators are extending designer intent, providing evidence about facts, or exercising authority—and use different validation and aggregation methods for each, rather than treating all annotations the same way.

Summary

This paper examines how human feedback shapes AI behavior through RLHF, identifying three distinct conceptual models: extension (annotators extend designer judgments), evidence (annotators provide factual information), and authority (annotators represent population preferences).

alignment evaluation safety

Key Terms

reinforcement-learning-from-human-feedback preference-alignment annotation-aggregation