RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter|June 4, 2026arXiv

Key Takeaway

Instead of treating all reasoning steps equally when a final answer is wrong, you can use the model to identify which intermediate steps were on the right track—this reduces training variance and improves sample efficiency for reasoning models.

Summary

This paper tackles a key problem in training reasoning models: when you can only check if the final answer is correct, how do you know which steps in the reasoning process were actually helpful? RREDCoT solves this by using the model itself to figure out which parts of the reasoning chain deserve more credit, improving training efficiency without extra computation.

training reasoning

Key Terms

chain-of-thought credit-assignment group-relative-policy-optimization delayed-reward monte-carlo-sampling