Training verifiers with access to correct answers creates a supervision signal that unlocks both test-time refinement and training-time self-improvement—two previously bottlenecked approaches to scaling reasoning models.
This paper tackles a key bottleneck in AI reasoning: building verifiers that can catch errors in model-generated solutions. The authors propose self-trained verification (STV), which trains verifiers by showing them reference solutions so they learn to spot mistakes.