Alignment in how models from different modalities (e.g., vision and language) represent the same stimulus.