Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Ali Hatamizadeh, Yejin Choi, Jan Kautz|May 21, 2026arXiv

Key Takeaway

Decoupling erase and write operations in linear attention with separate gates improves language model performance, especially on long-context tasks, while maintaining constant-memory decoding.

Summary

This paper improves linear attention mechanisms by separating the control of what to forget from what to remember in compressed memory. Instead of using a single gate to control both erasing old information and writing new information, Gated DeltaNet-2 uses separate channel-wise gates for each operation, making memory updates more flexible and efficient.

architecture efficiency reasoning

Key Terms

linear-attention state-space-models channel-wise-decay fast-weight-update