The process of combining information from multiple modalities (e.g., vision and text) into a unified representation.