In-Context Reward Adaptation for Robust Preference Modeling

Zhenyu Sun, Zheng Xu, Ermin Wei|May 28, 2026arXiv

Key Takeaway

Instead of training separate reward models for each group of users, you can use a single transformer that learns to adapt its reward predictions from just a few preference examples, making alignment more scalable when human values differ.

Summary

This paper proposes a method to make reward models used in AI alignment more flexible by letting them adapt to different human preferences on-the-fly, rather than using a single fixed reward model. The key insight is that adding human response time as an extra signal helps transformers learn to adjust their reward predictions based on a few examples of new preferences.

alignment training reasoning

Key Terms

reinforcement-learning-from-human-feedback in-context-learning preference-modeling reward-model distribution-shift