FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

Jiaju Han, Ben Zhang, Xuemeng Sun, Qike Zhang, Yuxian Dong et al.|June 15, 2026arXiv

Key Takeaway

Combining infrared and RGB satellite imagery with modality-specific captions significantly improves vision-language models' ability to understand Earth observation data—infrared-aware text supervision is essential for effective multi-modal learning.

Summary

FusionRS is the first large-scale dataset pairing RGB and infrared satellite images with text descriptions for training AI models that understand both types of imagery together.

multimodal data applications

Key Terms

vision-language-models dual-modal-learning modality-specific-supervision remote-sensing