Splitting Transformer computation into separate streams for token prediction and state maintenance improves both training efficiency and model performance—a simple architectural change with consistent gains across scales.
This paper proposes separating two functions in Transformers: predicting the next token and maintaining state for future predictions. The authors design a dual-stream architecture and show it improves language modeling efficiency and downstream task performance by 2-3% compared to standard Transformers.