A transformer-based language model architecture that uses disentangled attention mechanisms to improve how the model weighs different parts of the input text when making predictions.