When training models with RL and AI feedback, reward signal design is critical—robust reward shaping prevents exploitation better than algorithm choice, and rule-based corrections can fix systematic failures like verbatim copying.
This paper tackles a real-world problem in job search: generating portable queries that capture candidate qualifications without user-specific details. The authors use reinforcement learning with AI feedback (RLAIF) to train models, but discover that standard reward signals get exploited—models learn to copy text verbatim instead of generalizing.