The ability of a model to share semantic understanding between different input modalities like vision and text.