Language models encode a linear representation of expected success that directly influences their confidence and decision-making—understanding this could improve how we steer model behavior and diagnose when models are uncertain.
This paper discovers that language models internally represent a 'value axis'—a direction in their activation space that tracks whether their current strategy will succeed. By analyzing Qwen3-8B, researchers show this axis predicts confidence levels, code correctness, and backtracking behavior, and that steering along it causally changes how the model explores vs. commits to solutions.