A post-training approach for language models using rewards that can be objectively verified, like correctness on benchmarks.