You can replace attention with a linear-time polynomial mixer and get similar results with much faster inference—especially valuable for long sequences where attention becomes prohibitively expensive.
PoM replaces the expensive attention mechanism in transformers with a polynomial-based token mixer that runs in linear time instead of quadratic. It compresses all tokens into a learned polynomial representation, letting each token extract relevant context from this compact form.