Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin|April 17, 2026arXiv

Key Takeaway

When combining audio and text, align them indirectly through a shared joint embedding rather than directly contrasting them, and use structural consistency losses to prevent one modality from dominating the learned representation.

Summary

HILBERT is a multimodal framework that learns document-level representations from long audio-text sequences in low-resource settings.

multimodal training architecture

Key Terms

contrastive-learning cross-modal-attention mixture-of-experts centered-kernel-alignment mutual-information-balancing