Before training a multimodal model, use the paper's diagnostic procedure to determine whether alignment, prediction, or neither will work for your specific data—saving wasted effort on approaches that will fail or harm performance.
This paper explains when to use cross-modal alignment versus cross-modal prediction for multimodal learning. Using a mathematical framework, the authors identify four regimes where each approach works best, fails, or actively hurts performance. They provide a practical diagnostic tool to test real datasets with minimal labeled data before committing to training.