Instead of training separate reward models for each group of users, you can use a single transformer that learns to adapt its reward predictions from just a few preference examples, making alignment more scalable when human values differ.
This paper proposes a method to make reward models used in AI alignment more flexible by letting them adapt to different human preferences on-the-fly, rather than using a single fixed reward model. The key insight is that adding human response time as an extra signal helps transformers learn to adjust their reward predictions based on a few examples of new preferences.