When Can LLMs Learn to Reason with Weak Supervision?

Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel et al.|April 20, 2026arXiv

Key Takeaway

Models generalize under weak supervision when they maintain steady improvement during training (avoiding rapid saturation), and this can be achieved by fine-tuning on explicit reasoning traces combined with domain-specific pre-training.

Summary

This paper investigates when large language models can learn to reason effectively with weak supervision (scarce data, noisy rewards, or self-generated rewards). The key finding is that models generalize when they have a prolonged training phase where performance improves steadily, rather than quickly memorizing.

reasoning training evaluation

Key Terms

reinforcement-learning-from-verifiable-rewards reasoning-faithfulness training-reward-saturation supervised-fine-tuning