The Value Axis: Language Models Encode Whether They're on the Right Track

Nick Jiang, Isaac Kauvar, Jack Lindsey|June 15, 2026arXiv

Key Takeaway

Language models encode a linear representation of expected success that directly influences their confidence and decision-making—understanding this could improve how we steer model behavior and diagnose when models are uncertain.

Summary

This paper discovers that language models internally represent a 'value axis'—a direction in their activation space that tracks whether their current strategy will succeed. By analyzing Qwen3-8B, researchers show this axis predicts confidence levels, code correctness, and backtracking behavior, and that steering along it causally changes how the model explores vs. commits to solutions.

reasoning alignment

Key Terms

mechanistic-interpretability activation-steering value-function direct-preference-optimization in-context-reinforcement-learning