Predicting future behavior is more effective for steering reasoning models than detecting past behavior—this distinction enables better control without degrading output quality.
This paper shows that steering large reasoning models works better when you predict what the model will do next, rather than detecting what it already did. The researchers train probes to forecast future behaviors from intermediate reasoning steps, then use these predictions to guide text generation toward desired outcomes with minimal quality loss.