Training separate reward objectives for different tasks (e.g., binary judgment vs. error localization) instead of optimizing them jointly.