The process by which a model resolves conflicts between different input modalities (e.g., audio vs. text).