Treating geo-localization as a sequential zooming problem over maps, rather than image retrieval, achieves better results and avoids the limitations of contrastive learning approaches that struggle with landmark visibility mismatches.
This paper tackles cross-view geo-localization—matching street-view photos to satellite maps to pinpoint a camera's location without GPS. Instead of the standard approach of comparing images in a shared embedding space, the authors propose a new method that zooms progressively into a satellite map, making sequential decisions to narrow down the location.