Computing relevance or similarity scores between information from different modalities (e.g., text and images).