Combining infrared and RGB satellite imagery with modality-specific captions significantly improves vision-language models' ability to understand Earth observation data—infrared-aware text supervision is essential for effective multi-modal learning.
FusionRS is the first large-scale dataset pairing RGB and infrared satellite images with text descriptions for training AI models that understand both types of imagery together.