A mechanism that allows a model to focus on relevant parts of the input by computing relationships between all pairs of tokens, enabling deep understanding but requiring significant memory.
Performance retention over long documents and conversations
Multi-step reasoning, logic puzzles, mathematical problem-solving