Models generalize under weak supervision when they maintain steady improvement during training (avoiding rapid saturation), and this can be achieved by fine-tuning on explicit reasoning traces combined with domain-specific pre-training.
This paper investigates when large language models can learn to reason effectively with weak supervision (scarce data, noisy rewards, or self-generated rewards). The key finding is that models generalize when they have a prolonged training phase where performance improves steadily, rather than quickly memorizing.