When combining audio and text, align them indirectly through a shared joint embedding rather than directly contrasting them, and use structural consistency losses to prevent one modality from dominating the learned representation.
HILBERT is a multimodal framework that learns document-level representations from long audio-text sequences in low-resource settings.