Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

Liyao Wang, Ruipu Wu, Haojun Xu, Lei Shi, Linjiang Huang et al.|June 29, 2026arXiv

Key Takeaway

Combining explicit 3D geometry (camera poses, spatial relationships) with visual matching dramatically improves cross-view localization and enables zero-shot transfer between ground and drone views without paired training data.

Summary

This paper tackles cross-view object geo-localization—finding a target object in satellite imagery when given a ground or drone photo. The authors introduce a large dataset with 220K+ image pairs and geometric metadata, plus GAGeo, a unified framework that predicts object locations, masks, and camera poses simultaneously using 3D spatial understanding rather than just appearance matching.

multimodal evaluation architecture

Key Terms

cross-view-matching camera-pose-estimation permutation-equivariant zero-shot-generalization contrastive-loss