When to Align, When to Predict: A Phase Diagram for Multimodal Learning

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero|June 9, 2026arXiv

Key Takeaway

Before training a multimodal model, use the paper's diagnostic procedure to determine whether alignment, prediction, or neither will work for your specific data—saving wasted effort on approaches that will fail or harm performance.

Summary

This paper explains when to use cross-modal alignment versus cross-modal prediction for multimodal learning. Using a mathematical framework, the authors identify four regimes where each approach works best, fails, or actively hurts performance. They provide a practical diagnostic tool to test real datasets with minimal labeled data before committing to training.

multimodal evaluation training

Key Terms

cross-modal-alignment cross-modal-prediction multimodal-representation-learning nuisance-correlation phase-diagram