You can train reasoning models with imperfect, reusable skills from past experience rather than perfect reference answers, by having multiple skill-based teachers vote on whether they help or hurt—and the model learns from disagreements between teachers.
This paper improves how language models learn to reason by using a skill bank—a collection of past problem-solving techniques—as training guidance. Instead of assuming perfect reference answers, the method validates whether retrieved skills actually help or hurt on new problems, then uses this validation to train the model more effectively.