A training technique that selects high-quality examples based on a reward signal to improve model learning.