The task of assessing and scoring the quality, correctness, or alignment of text outputs, often used to filter or rank model responses.
Multi-step reasoning, logic puzzles, mathematical problem-solving