Predicting Future Behaviors in Reasoning Models Enables Better Steering

Evgenii Kortukov, Piotr Komorowski, Florian Klein, Paula Engl, Gabriele Sarti et al.|June 9, 2026arXiv

Key Takeaway

Predicting future behavior is more effective for steering reasoning models than detecting past behavior—this distinction enables better control without degrading output quality.

Summary

This paper shows that steering large reasoning models works better when you predict what the model will do next, rather than detecting what it already did. The researchers train probes to forecast future behaviors from intermediate reasoning steps, then use these predictions to guide text generation toward desired outcomes with minimal quality loss.

reasoning safety evaluation

Key Terms

activation-probing test-time-steering reasoning-model hidden-representations