Process reward models need to account for the full context of reasoning paths and penalize risky intermediate steps, not just reward final correctness—this matters most in domains where wrong reasoning paths are costly.
This paper addresses a key problem in evaluating AI reasoning: process reward models often give high scores to flawed reasoning paths because later correct steps mask earlier mistakes. The authors propose SCPRM, which evaluates reasoning steps by looking at what came before and measuring distance to the target, then use it with tree search to answer questions about knowledge graphs.