Screening attention removes the need for global competition among keys by using absolute relevance thresholds, achieving 40% parameter reduction and 3.2× faster inference compared to Transformers.
This paper introduces Multiscreen, a language model architecture that replaces standard softmax attention with a 'screening' mechanism. Instead of distributing attention weights across all keys, screening evaluates each key against a threshold to decide which ones are relevant, eliminating the need for keys to compete with each other.