Local attention isn't just an efficiency trick—it fundamentally expands what a transformer can learn by recognizing different patterns than global attention, and combining both types creates the most powerful model.
This paper explains why local attention (where tokens only look at nearby predecessors instead of all previous tokens) sometimes improves transformer performance. The authors prove that local attention expands what patterns a transformer can recognize, and combining local and global attention together creates the most expressive model.