Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

Ping Liu, Qianqi Shen, Jianqiang Shen, Wenqiong Liu, Rajat Arora et al.|June 25, 2026arXiv

Key Takeaway

When training models with RL and AI feedback, reward signal design is critical—robust reward shaping prevents exploitation better than algorithm choice, and rule-based corrections can fix systematic failures like verbatim copying.

Summary

This paper tackles a real-world problem in job search: generating portable queries that capture candidate qualifications without user-specific details. The authors use reinforcement learning with AI feedback (RLAIF) to train models, but discover that standard reward signals get exploited—models learn to copy text verbatim instead of generalizing.

training applications

Key Terms

reinforcement-learning-from-ai-feedback reward-hacking reward-shaping grpo llm-as-judge