Characterizing the Expressivity of Local Attention in Transformers

Jiaoda Li, Ryan Cotterell|May 1, 2026arXiv

Key Takeaway

Local attention isn't just an efficiency trick—it fundamentally expands what a transformer can learn by recognizing different patterns than global attention, and combining both types creates the most powerful model.

Summary

This paper explains why local attention (where tokens only look at nearby predecessors instead of all previous tokens) sometimes improves transformer performance. The authors prove that local attention expands what patterns a transformer can recognize, and combining local and global attention together creates the most expressive model.

architecture reasoning evaluation

Key Terms

local-attention global-attention linear-temporal-logic recognizer-expressivity