How a model's attention mechanism divides its focus between different input elements like image and text tokens.