When to Align, When to Predict: A Phase Diagram for Multimodal Learning — ThinkLLM