The State-Prediction Separation Hypothesis

Giovanni Monea, Nathan Godey, Kianté Brantley, Yoav Artzi|July 1, 2026arXiv

Key Takeaway

Splitting Transformer computation into separate streams for token prediction and state maintenance improves both training efficiency and model performance—a simple architectural change with consistent gains across scales.

Summary

This paper proposes separating two functions in Transformers: predicting the next token and maintaining state for future predictions. The authors design a dual-stream architecture and show it improves language modeling efficiency and downstream task performance by 2-3% compared to standard Transformers.

architecture efficiency training

Key Terms

transformer-architecture language-modeling compute-efficiency dual-stream-architecture state-representation