When LLMs make reasoning mistakes, the solution isn't just providing correct answers—instead, extract what multiple reasoning attempts agree on and use that consensus to rebuild better reasoning chains.
This paper identifies two types of flaws in how language models reason: errors within individual steps and problems with step sequencing. The authors show that simply giving models correct answers doesn't fix reasoning.