A training technique where human evaluators rate model outputs, and the model learns to produce responses that humans prefer.