Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming
Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retri…
Authors: Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz
Just Zo om In: Cross-View Geo-Lo calization via Autoregressiv e Zo oming Y un us T alha Erzurumlu 1 , Jiy ong K w ag 1 , and Alp er Yilmaz 1 Photogrammetric Computer Vision Lab, The Ohio State Universit y {erzurumlu.1,kwag.3,yilmaz.15}@osu.edu Abstract. Cross-view geo-lo calization (CV GL) is the task of estimating a camera’s lo cation by matching a street-view image to geo-referenced o verhead imagery , enabling GPS-denied localization and navigation. Ex- isting methods almost univ ersally pose CV GL as an image-retriev al prob- lem in a con trastively trained embedding space. This formulation ties p erformance to large batc hes and hard negativ e mining, and it ignores b oth the geometric structure of maps and the cov erage mismatch betw een street-view and ov erhead imagery . In particular, salient landmarks vis- ible from the street-view can fall outside a fixed satellite crop, making retriev al targets am biguous and limiting explicit spatial inference ov er the map. W e prop ose Just Zo om In , an alternative form ulation that p erforms CV GL via autoregressiv e zo oming ov er a cit y-scale ov erhead map. Starting from a coarse satellite view, the mo del takes a short se- quence of zo om-in decisions to select a terminal satel lite c el l at a target resolution, without contrastiv e losses or hard negativ e mining. W e fur- ther introduce a realistic b enchmark with crowd-sourced street views and high-resolution satellite imagery , reflecting real capture conditions. On this b enc hmark, Just Zo om In achiev es state-of-the-art p erformance, im- pro ving Recall@1 < 50m by 5.5% and Recall@1 < 100m by 9.6% o ver the strongest contrastiv e-retriev al baseline, demonstrating the ef- fectiv eness of sequential coarse-to-fine spatial reasoning for cross-view geo-lo calization. Keyw ords: Cross-view geo-lo calization · Autoregressive mo deling · Vi- sual lo calization · Satellite imagery 1 In tro duction Visual lo calization estimates a camera’s lo cation by matching a captured image to a known en vironment. Sp ecifically , cross-view geo-lo calization (CVGL) tackles this problem by using geo-referenced satellite imagery to lo calize a street-view query image [18, 39]. Satellite imagery offers dense, large-scale co verage of out- do or environmen ts and is widely av ailable for many regions, making CVGL at- tractiv e for GPS-denied lo calization and navigation. The core difficult y in CVGL is bridging the extreme viewp oin t differences betw een street-view and satellite imagery . 2 Y. T. Erzurumlu et al. Fig. 1: T w o paradigms for cross-view geo-localization. (a) Retriev al-based ap- proac hes maintain a large GPS-tagged reference database and use a con trastively trained enco der to p erform nearest-neigh b or search at query time. (b) Autor e gr essive ge o-lo c alization (ours) replaces global retriev al with sequential, coarse-to-fine zo om-in decisions ov er multi-scale satellite imagery , dramatically reducing dep endence on ex- haustiv e database search. T o handle these viewp oint differences, mo dern CVGL systems predominan tly rely on con trastively trained retriev al mo dels; given a street-view image, the sys- tem retriev es the most similar satellite tile from a large GPS-tagged reference database [5, 28, 43, 44]. In practice, they learn a single embedding space in which street-view queries and satellite tiles are pulled together if they corresp ond to the same lo cation and pushed apart otherwise, commonly relying on man y infor- mativ e negativ es (via large effective batch sizes and/or hard negativ e mining). A t inference time, lo calization is p erformed by nearest-neighbor search in the em b edding space o ver a large satellite database shown in Figure 1.a. Despite strong p erformance, this con trastive–retriev al form ulation has three practical limitations for scalable, realistic CVGL. First, contrastiv e retriev al p er- formance often b enefits from many informative negatives, implemented via large effectiv e batches and/or hard negativ e mining (HNM) [5, 43]. Main taining and refreshing hard negatives can add significan t training compute and engineering complexit y [2]. Second, retriev al-based CV GL must store and searc h a large GPS-tagged reference database at inference time. Fine-grained lo calization ov er a cit y-scale region requires many densely sampled satellite tiles, so memory fo otprin t grows roughly linearly with the num ber of tiles [5, 43, 44], making it exp ensiv e to extend to larger areas. Just Zo om In for Cross-View Geo-Lo calization 3 Fig. 2: Cov erage mismatc h. In the street-view image (a), the stadium and its distinc- tiv e gate are clearly visible. In the satellite crop (b) from the same region, this landmark falls outside of the patch, illustrating ho w small tiles can miss critical cues present in the street-view. Images sourced from Go ogle Street View and Go ogle Maps [8, 9]. Third, the retriev al formulation treats the map as an unstructured collec- tion of independent, fixed-size tiles, without explicit links betw een neighboring regions or coarser zo om levels. This tile-centric view ignores the underlying geo- graphic hierarch y and can exacerbate a c over age mismatch b et ween mo dalities: due to p erspective and tile framing/misalignment, discriminative context visible in the street-view image ma y lie outside a candidate satellite patch even when the patc h is geographically close (Figure 2). As a result, existing metho ds are optimized to rank isolated patches rather than to reason ov er spatial context and multi-scale structure in the satellite imagery . T o address these limitations, we reformulate cross-view geo-lo calization as autor e gr essive zo oming rather than single-shot retriev al from a flat database. Our metho d, Just Zo om In (Figure 1.b), lo calizes a ground query b y pro ducing a short sequence of zo om-in actions ov er a multi-scale satellite map, starting from a coarse view and refining the prediction un til a terminal resolution. Motiv ated by 1D autoregressive prediction in NLP , we cast eac h zo om decision as a categorical tok en: at each of the N zo om levels, the current satellite tile is sub divided in to K 2 candidates, and the mo del predicts the next patch index conditioned on the street-view image and previous decisions. W e instantiate this framework with an off-the-shelf visual encoder and a T ransformer deco der [37] trained with c ausal masking and a sup ervised next-action (next-index) prediction ob jectiv e, av oiding con trastive losses, explicit hard-negative mining, and large effective batch sizes. The resulting coarse-to-fine pro cess explicitly follows the geographic hierarch y of the map: early steps leverage wide-area context, while later steps fo cus on lo cal detail, helping mitigate cov erage mismatch b et ween street-lev el and satellite imagery . Since each step compares the query to only a few candidates, inference cost scales with the num b er of zo om steps rather than the total num b er of tiles 4 Y. T. Erzurumlu et al. in a city-scale database, reducing reliance on exp ensive nearest-neighbor search while achieving strong lo calization accuracy under realistic capture conditions. This sequen tial formulation calls for a b enc hmark tailored to m ulti-scale rea- soning. T o ev aluate autoregressive zo oming, we require satellite imagery with an explicit resolution hierarch y and street-views that resem ble real capture con- ditions, such as limited field-of-view (F oV), unknown orien tation, and diverse qualit y . Existing datasets [19, 39, 44], how ever, provide only single-scale satel- lite images paired w ith standardized 360 ° panoramas of kno wn orientation and fixed in trinsics, which simplifies the task and departs from typical first-p erson or dash-cam imagery . Recent work further shows that p erformance can degrade substan tially when moving from these idealized, orien tation-aligned protocols to unknown-orien tation and limited-F oV settings [23]. W e therefore construct a new b enchmark with multi-scale satellite images and randomly oriented, limited F oV street views captured under div erse conditions shown in Figure 3, so that b oth mo dalities matc h the demands of a realistic multi-scale zo om-in setting. Sc op e and c omp ar ability. W e target the r etrieval stage of city-/A OI-scale CVGL: giv en a ground query and a large ov erhead search region, the goal is to select the correct o verhead c el l at a target resolution (Sec. 4.1). Unlik e cross-view metric p ose estimation , which assumes a coarse location prior (e.g., 20–50m) and then refines 3-DoF pose, w e assume no suc h prior and output a coarse cell in tended to pr ovide that prior for do wnstream refinemen t [17, 27, 40]. W e further distinguish our goal from hier ar chic al end-to-end pip elines that couple coarse retriev al with metric refinement in a single system: our metho d targets and strengthens the retriev al component that suc h pipelines dep end on, and can b e used as their front-end [31]. W e also depart from common retriev al-st yle CV GL proto cols that rely on 360 ° panoramas and known/normalized orientation (e.g., North alignment), which can act as shortcuts; ConGeo shows p erformance can drop sharply , esp ecially under limited-F oV, unkno wn-heading, and cross- area ev aluation, highligh ting that man y mo dels do not transfer to realistic user imagery [23]. Our fo cus is the within-AOI (same-ar e a) setting where the lo caliza- tion region is supervised during training, enabling region-specific learning ov er a multi-scale ov erhead hierarch y . Generalizing to unknown ar e as under realis- tic constraints remains substan tially harder; recen t activ e geo-lo calization work explores this direction, t ypically on smaller A OIs/coarser discretizations and with muc h heavier training mac hinery (billion-scale pretrained mo dels and RL), whereas we show that sup ervised next-action zo oming with a standard vision bac kb one is comp etitiv e in the sup ervised within-AOI regime [22, 26]. Contributions. Our main con tributions are: – Reform ulating CV GL. Motiv ated by the addressed limitations we refor- m ulate cross-view geo-lo calization as autor e gr essive zo oming . – An autoregressive zo om-in model. W e propose Just Zo om In , an au- toregressiv e CV GL transformer trained via sup ervised next-action predic- tion. Our design achiev es computational efficiency while explicitly leveraging hierarc hical visual cues from coarse to fine satellite scales. Just Zo om In for Cross-View Geo-Lo calization 5 – A realistic, m ulti-scale b enchmark. W e in troduce a cross-view dataset that pairs m ulti-scale satellite imagery with cro wd-sourced street-view im- ages exhibiting unknown orien tation, limited and v ariable F oV, and div erse app earance conditions. Our approach ac hieves state-of-the-art performance on this b enchmark, with +5.50% and +9.63% absolute gains at R@50m and R@100m o ver the state-of-the-art contrastiv e-retriev al baseline. 2 Related W orks The Dominant Retriev al P aradigm. Cross-view geo-localization addresses visual lo calization by lev eraging geo-referenced satellite imagery , which is v alued for its extensiv e global cov erage, currency , and accessibilit y . Early studies in this area formulated the problem using contrastiv e learning and image-retriev al ob jectives [18, 39]. This formulation remains the dominan t approach in the literature to date [3, 5, 13, 19, 25, 28, 29, 43–45]. Because satellite and street-level views differ dramatically , the task requires robust feature extraction [39, 43, 45], principled aggregation [1, 28], and w ell-designed sampling [3, 5]. On the feature-extraction front, W orkman et al . [39] sho wed that CNN- learned features [15] outp erform hand-crafted descriptors [18]. Building on this, T ransGeo [43] rep orted further gains b y emplo ying Vision T ransformers [6], and Sample4Geo [5] adopted Con vNeXt [20]. Regarding aggregators, CVM- Net [13] leverages NetVLAD [1]; SAF A [28] introduces a customized spatial feature aggregator; and SAIG [45] utilizes an MLP-Mixer [34] arc hitecture. The largest p erformance b oosts ha ve stemmed from sampling: HNM prov ed trans- formativ e. In particular, Sample4Geo [5] demonstrated mark ed improv ements in its sampling approach. Ho wev er, HNM makes training b oth compute-heavy and memory-h ungry , highlighting a cen tral limitation of contrastiv e learning. In con- trast, our formulation attains state-of-the-art p erformance without sp ecialized sampling. Some studies attempted to close the view-p oin t gap through p olar transforms [28] or generative metho ds [25, 33], but receiv ed little uptake owing to unrealistic assumptions, particularly the requirement that the camera p osition in the satellite imagery b e known. Benc hmark and Dataset Limitations. Another crucial asp ect of this task is the dataset and the realism of the b enc h- marks. Early datasets such as CVUSA [39] and CV A CT [19] used one-to-one pairing for data generation, i.e., a single street-view image sampled at the center of each satellite tile. This renders the task b oth trivial and unrealistic: the ref- erence database b ecomes sparsely p opulated ov er the test region, hindering the lo calization of no vel photos. VIGOR [44] addressed this limitation b y prop osing man y-to-one sampling, in whic h multiple street-view images are sampled from random lo cations within each satellite tile. This setting is more realistic since the exact lo cation of the street view capture is unkno wn and provides better cov- erage. How ev er, in VIGOR and other datasets, the street-view orien tations are kno wn, and the F oV s are 360 ◦ , which simplifies the task and again departs from real-w orld conditions. Moreo ver, the dataset is single-sourced and highly stan- dardized, which can yield mo dels that generalize p o orly to out-of-distribution 6 Y. T. Erzurumlu et al. T able 1: Dataset comparisons for cross-view geo-lo calization. W e compare our proposed multi-scale benchmark against prior datasets under k ey capture and sup ervision attributes. SV GPS : Aligne d = center-aligned one-to-one pairing; Arbitr ary = query lo cation not constrained to the tile center. Gnd F oV : 360 ° pano = panorama queries; Cr ops = crops from panoramas; Persp e ctive = nativ e non-panoramic images. He ading : Fixe d = kno wn/normalized orientation; Arbitr ary = unkno wn or random orien tation. Dataset # Photos Multi-Scale Ground / Pairing Proto col Domain Types SV GPS Gnd F oV Heading Urban Suburban Rural Offroad CVUSA [39] 44K ✗ Aligned 360 ° pano Fixed ✓ ✓ ✓ ✓ CV ACT [19] 128K ✗ Aligned 360 ° pano Fixed ✓ ✓ ✗ ✗ V o & Hay es [38] 450K ✗ Aligned Crops Arbitrary ✓ ✓ ✗ ✗ VIGOR [44] 105K ✗ Arbitrary 360 ° pano Fixed ✓ ✗ ✗ ✗ Ours 300K ✓ Arbitrary Perspective Arbitrary ✓ ✓ ✓ ✓ data and make the b enchmark easier. By contrast, while we also adopt many-to- one sampling, we use random orientations and limited, randomly selected F oV s, and w e dra w street-view images from div erse sources, times of da y , seasons, w eather conditions, and quality lev els, resulting in a more challenging and real- istic b enc hmark as illustrated in Figure 3. Recent work considers limited F oV and unkno wn orientation [23, 29], but largely relies on panoramic training data with sim ulated F oV/heading, whereas our b enc hmark uses non-panoramic, randomly orien ted, limited F oV images, hence the settings are not directly comparable. Cross-View Pose Estimation. A parallel line of work to CVGL is cross- view pose estimation, where a coarse lo cation prior (typically 20–50m) is a v ail- able, and the goal is to reco v er a 3-DoF camera pose with sub-meter accu- racy [17, 27, 40]. Although recent methods target this setting [16, 35, 41], its practicalit y hinges on a reliable CV GL comp onen t to furnish the coarse prior and robust cross-view matches-precisely what our approach aims to provide. 3 Metho d In this section we will discuss how w e curated high qualit y m ulti-scale dataset, 3.1, and the theoretical and architectural details of Just Zo om In Model, 3.2. 3.1 Multi-Scale Satellite and Street-View Dataset The dev elopmen t of our autoregressive geo-localization framework requires a b espoke dataset that satisfies tw o key criteria: (1) realistic v ariability in street- view imagery (3.1), and (2) multi-scale, high-resolution satellite imagery (3.1). Since no existing op en-source cross-view geo-lo calization b enc hmark meets these requiremen ts, w e constructed a dedicated dataset b y com bining cro wd-sourced street-view imagery from Mapillary [4] with gov ernment-sourced high-resolution aerial orthophotography [10]. Just Zo om In for Cross-View Geo-Lo calization 7 Fig. 3: Dataset examples. Street-view samples from our cross-view image local- ization corpus illustrating the tw o defining c haracteristics: (i) limite d F oV typical of first-p erson and dash-cam capture; and (ii) broad diversit y across time of day , seasonal app earance, weather, scene type , and capture platforms . This v ariability-together with viewp oint changes, o cclusions, and motion blur-widens the app earance gap to o verhead imagery and makes cross-view matching more challenging. Street-view processing. Mapillary [4] is selected as the source of street-view query images due to its inherent representation of in-the-wild visual v ariability . Mapillary is a large-scale crowd-sourced public dataset licensed under CC BY- SA, containing more than 2 billion images. The platform pro vides images that span a range of fields of view captured b y heterogeneous devices under div erse en vironmental conditions, including v arying w eather, seasons, time of da y , and camera mo dels shown in Figure 3. This diversit y introduces realistic app earance c hanges that are often absent in curated or proprietary datasets, making it a suitable testbed for robust cross-view localization. W e collect data from the W ashington D.C. area, cov ering b oth dense urban centers and surrounding rural regions, resulting in appro ximately 300k street-view images. More details are included in the App endix. The crowd-sourced nature of Mapillary introduces noise such as lo w-resolution, outdated, or duplicate imagery . T o conv ert ra w sequences into a clean and ge- ographically reliable dataset, w e adopt a m ulti-stage filtering pipeline inspired b y Map It An ywhere [12]. First, w e discard images whose initial GPS co ordi- nates significantly deviate from their Structure-from-Motion (SfM) refined p oses, remo ving samples with inaccurate ground-truth lo cations. Second, to promote spatial div ersity , we enforce a minim um spatial separation constraint b y retaining only key frames, ensuring that no t wo consecutive images are within 4.0 meters of eac h other. Third, w e filter images based on temp oral and device metadata to exclude outdated or unreliable camera sources, increasing the div ersity of en vironmental and seasonal conditions represen ted. Finally , we p erform quality screening following the Map It An ywhere pipeline, discarding images with mo- tion blur, lens distortion, heavy o cclusion, or reflections. Non-panoramic images are standardized to a resolution of 512 × 384 . 8 Y. T. Erzurumlu et al. Fig. 4: Overview of Just Zo om In . (a) Image enco ding. A shared, off-the-shelf vision enco der maps the street-view image I g and the multi-scale satellite maps { M t } to global represen tation tok ens e g and e t , single token p er image. (b) Autoregressiv e action mo deling. The image tokens, in terleav ed with previously c hosen action tok ens, are fed to a causal transformer that predicts a distribution o ver the next zo om-in action a t at each step. (c) Sample / zo om-in sequence. Beginning from M 0 , the mo del selects the most probable patch, zo oms to obtain M t +1 , and iterates un til a terminal patc h M N ; the center of M N is taken as the lo cation estimate. Satellite-view pro cessing. F or aerial reference data, we use publicly av ailable go vernmen t-sourced orthophotos [10]. These datasets provide high-resolution im- agery distributed under the p ermissiv e CC BY 4.0 license, ensuring repro ducibil- it y and scalabilit y . The aerial data for the W ashington D. C. region is obtained from the official op en gov ernment data p ortal similar to F ervers et al . [7], cov er- ing a 10 km × 10 km area that corresp onds to the spatial distribution of street- view images from Mapillary . T o support coarse-to-fine zooming, w e organize the satellite imagery in to a hierarch y of tiles. A t eac h zoom lev el w e subdivide the curren t tile into a fixed grid of K × K = 4 × 4 patches, and the mo del chooses one patc h to zo om into. In all exp eriments we use N = 4 zo om steps (i.e., four actions), whic h corresp ond to nominal tile fo otprints of 10 km × 10 km , 2 . 5 km × 2 . 5 km , 625 m × 625 m , and 156 . 25 m × 156 . 25 m . After the final zo om decision, we take the center 39 . 06 m × 39 . 06 m region inside the selected 156 . 25 m tile as the output cell used for ev aluation. This finest scale is only used to define the prediction fo otprin t and to compute geo desic distances; it is not supplied to the mo del as an input image. W e provide a comparison with our new dataset with existing ones at T able 1. 3.2 Just Zo om In The autoregressive zo oming pro cess is mo deled as a sequential decision-making problem. Giv en a street-view image I g , the goal is to predict the correct satellite patc h M N at eac h scale by learning a policy π that produces a sequence of N zo om-in actions. Our autoregressive geo-lo calization framework consists of t w o main comp onen ts. First, as shown in 4.a the shared-weigh t vision enco der bridges Just Zo om In for Cross-View Geo-Lo calization 9 T able 2: Cross-view geo-lo calization results on our prop osed dataset. W e re- pro duce all con trastive-based baselines using their official implementations. Our coarse- to-fine autoregressive approach outp erforms retriev al-based metho ds across all ev alua- tion thresholds. R@ τ m denotes Recall@1 within τ meters. Inf. Mem. (GB) rep orts the p eak memory required to store and searc h the reference database at inference time, T r ain. Time (GPU hours) is the approximate end-to-end training time on one NVIDIA A6000 GPU, and HNM indicates whether the metho d relies on hard negative mining during training. Metho d R@40m ↑ R@50m ↑ R@100m ↑ Inf. Mem. (GB) ↓ T rain. Time ↓ HNM SAIG-D [45] 39.36 47.52 64.17 18 > 90 ✓ T ransGeo [43] 45.97 54.55 67.61 11 > 90 ✓ Sample4Geo [5] 52.85 60.81 71.30 16 > 90 ✓ Ours (BS. = 32) 51.08 61.49 77.43 9 61 ✗ Ours (BS. = 64) 54.10 64.74 79.74 9 58 ✗ Ours (BS. = 128) 55.74 66.31 80.93 9 52 ✗ the domain gap b et ween street-view and satellite images, pro jecting them in to a unified feature space. Second, the autoregressiv e deco der [37] receives the street- view feature along with multi-scale satellite features and, under a causal masking strategy , predicts the next zo om action step-by-step, as illustrated in 4. W e will discuss these comp onen ts in more detail in the follo wing sections. Vision Enco der. The first stage of our mo del is resp onsible for bridging the significan t domain gap b etw een street-view and satellite views. T o tackle this, w e employ the DINOv2-Base mo del [24] as our vision enco der backbone. DI- NOv2 is a p o werful foundation mo del pre-trained using self-sup ervised learning for general-purp ose visual representations. W e use a shared-weigh t DINOv2 en- co der to pro cess b oth the street-view image and all m ulti-scale satellite images. This weigh t sharing enco der pro jects b oth mo dalities into a common embedding space, effectively reducing the domain gap caused b y drastic viewp oint differ- ences. F or eac h street-view or satellite image, we extract only the [CLS] token em b edding as the output represen tation. The [CLS] tok en is designed to ag- gregate global information from the entire image into a compact vector. Street- view and m ulti-scale satellite images are enco ded and concatenated to form [ I g , M 0 , M 1 , . . . , M N ] for autoregressive mo del input. T o preserve the strong pre- trained visual knowledge while allowing adaptation to our sp ecific task domain, w e emplo y a partial fine-tuning strategy . The enco der is initialized with pre- trained DINOv2 weigh ts, and only the last four transformer blo cks are unfrozen and fine-tuned during training, a parameter-efficient transfer approach sho wn to b e effectiv e for lo calization tasks [14]. Autoregressiv e Geo-Lo calization Deco der. The mo del predicts a discrete zo om-action sequence Y = { y 0 , . . . , y N − 1 } , where eac h y t ∈ { 0 , . . . , K 2 − 1 } 10 Y. T. Erzurumlu et al. selects one of the K 2 c hild tiles at zoom step t . W e train with teac her forcing, maximizing the likelihoo d of the ground-truth action at each step (equiv alently , minimizing cross-entrop y ov er the K 2 action classes): L = − N − 1 X t =0 log P ( y ∗ t | I g , M 0 , y ∗ 0 , . . . , M t ) , (1) where ( · ) ∗ denotes ground-truth actions/tok ens. The action tok ens are in ter- lea ved with multi-scale satellite features to form the input sequence [ I g , M 0 , y ∗ 0 , M 1 , y ∗ 1 , . . . , M N − 1 ] . This interlea ving design ensures that each zo om decision is conditioned on the correct patch information, providing stronger guidance for subsequen t predictions. The transformer deco der with 1D causal mas king en- sures that each tok en attends only to itself and the past elemen ts, prev enting information leak age from future zo om steps. During inference, using the chain rule of probability , the join t ob jective fac- torizes into step-wise conditional probabilities: P ( Y | I g , M 0 ) = N − 1 Y t =0 P ( y t | I g , M 0 , y 0 , . . . , M t ) . (2) The autoregressive nature requires that each prediction be conditioned on pre- vious history decisions. A key adv antage of this setup is that the searc h space is progressiv ely reduced at ev ery zo om step. Figure 4.b sho ws an o verview of the sequen tial zo om-in mec hanism. 4 Exp erimen ts T raining and ev aluation are conducted on our curated dataset described in Sec- tion 3.1. 4.1 Metrics Unlik e standard image retriev al tasks that rely on rank-based metrics, our itera- tiv e zo oming approach pro duces a single final lo cation prediction. F urthermore, simple correct/incorrect classification accuracy is insufficien t for geospatial tasks, since predicting a neighboring patch is considerably more accurate than predict- ing a distan t one. Therefore, w e ev aluated performance using distance-based Recall@1 (R@1 < d ). Let d ( ˆ c i , c i ) denote the geo desic distance in meters b e- t ween the predicted cen ter ˆ c i and the ground-truth cen ter c i for sample i . The Recall@1 with a success radius τ ∈ { 40 m , 50 m , 100 m } is defined as: R@ τ m = 1 S S X i =1 1 ( d (ˆ c i , c i ) ≤ τ ) × 100 , (3) where 1 ( x ) = ( 1 , if x is true , 0 , otherwise. (4) Just Zo om In for Cross-View Geo-Lo calization 11 Fig. 5: F o cus at different zo om levels. W e visualize similarit y maps b et ween street- view patch tokens and the satellite [CLS] embedding (blue = low similarity , red = high similarit y) across three zo om levels. At the coarsest lev el, high similarity concentrates on far-a wa y regions, whereas at finer zoom levels it shifts tow ard nearby ob jects and road/building details, indicating that the mo del learns to exploit different scale-sp ecific visual cues. 4.2 Implementation Details The autoregressive transformer blo cks are built usin g mo dern comp onen ts for training stabilit y and p erformance. The mo del op erates with a hidden dimension of 768 , using 6 lay ers and 8 attention heads. W e adopt a pre-normalization structure with RMSNorm [42] inspired b y the LLaMA-3 family [11], where normalization is applied b efore the atten tion and feed-forward la yers. Rotary p ositional embeddings (RoPE) [32] are applied in 1D to enforce the sequential structure of the prediction pro cess. W e use Adam W [21] optimizer with a learning rate of 3 × 10 − 4 and apply gradient clipping with a norm threshold of 1.0 to prev ent explo ding gradients. T raining is conducted on tw o NVIDIA A6000 GPUs with a batc h size of 128 for 30 ep o c hs. 4.3 Comparison with Con trastiv e Baselines T able 2 presen ts a comparison b et ween our autoregressiv e geo-lo calization ap- proac h and standard contrastiv e learning-based baselines with 10 km × 10 km . All baselines are repro duced using their official implemen tations, with additional training details provided in the App endix. Our autoregressiv e framework con- sisten tly outp erforms contrastiv e retriev al metho ds across all distance thresh- olds. Notably , our metho d ac hieves a +5.5% absolute impro vemen t in R@50m compared to the state-of-the-art Sample4Geo [5] baseline with half the train- ing time without needing to HNM. This p erformance adv an tage supp orts the core premise that an iterative, coarse-to-fine search strategy is more effective than single-shot retriev al for fine-grained lo calization. By progressively narrow- ing the search space, Just Zo om In is able to lev erage contextual cues at the 12 Y. T. Erzurumlu et al. T able 3: Vision encoder ablations. (a) Comparison of different vision back- b ones. W e ev aluate differen t vision enco der backbones for our autoregressiv e geo- lo calization task. Overall, DINOv2 achiev es the strongest performance, outp erforming more recent architectures. (b) Impact of partial bac kb one fine-tuning. W e ev al- uate differen t degrees of la yer freezing in the vision enco der for our autoregressiv e geo-lo calization task. Overall, unfreezing the last 4 lay ers yields the b est p erformance, outp erforming b oth fully frozen and fully fine-tuned configurations. (a) Vision backbones. Method R@40m ↑ R@50m ↑ R@100m ↑ SigLIP2-B [36] 28.42 33.23 45.43 DINOv3-B [30] 63.88 69.44 79.74 DINOv2-B [24] 64.13 69.65 79.95 (b) Partial backbone fine-tuning. Method R@40m ↑ R@50m ↑ R@100m ↑ F ully F rozen 58.79 64.66 76.51 Last 4 Lay ers 66.98 72.25 83.08 Last 6 Lay ers 65.69 70.64 82.56 All Lay ers 59.45 64.31 74.78 appropriate spatial scale, av oiding the “needle-in-a-ha ystac k” failure mode that often arises when retriev al systems must search large geographic regions directly . The substantial margin at R@100m (nearly +10% ov er Sample4Geo [5]) further indicates that autoregressiv e geo-localization is highly robust, rarely produc- ing catastrophic lo cation errors that place the prediction in entirely incorrect neigh b orho ods. Moreo ver, as w e can see from Figure 5, mo del learns to utilize differen t visual cues from the street-view image at differen t zoom levels, as ex- p ected, effectiv ely dealing with the c over age mismatch problem present in the con trastive-retriev al metho ds. Figure 6 visualizes some of the results and zo om-in sequences. T able 4: Comparison of different satellite image represen tations. W e compare global [CLS] token features with v arian ts that incorp orate lo cal patch tokens from the vision enco der. Our results show that relying on the global [CLS] representation yields the highest lo calization accuracy . Metho d R@40m ↑ R@50m ↑ R@100m ↑ [CLS] + 256 63.38 68.25 79.83 [CLS] + 64 64.60 70.10 80.70 [CLS] + 32 65.80 72.10 83.02 [CLS] + 16 65.31 71.6 82.60 [CLS] + 8 65.31 70.76 82.12 [CLS] Only 66.98 72.25 83.08 4.4 Ablation Study T o ev aluate the con tribution of each comp onen t, we conducted ablation studies on our dataset. F or efficiency , all ablation exp erimen ts are p erformed using a Just Zo om In for Cross-View Geo-Lo calization 13 Fig. 6: Results. Qualitativ e visualization of our mo del’s auto-regressive lo calization. The first column sho ws the input street-view query . The subsequent columns (Lvl 1- 4) displa y the satellite map at each sequential zo om step. On each map, the mo del’s predicted patch is highlighted b y the grid ov erla y , the ground truth (GT) lo cation is marked with a red dot, and the 50m success radius is sho wn as a white dotted circle. 4 × 4 satellite patc h lay out with N = 3 zoom steps under a 2 km × 2 km search setup. Vision enco der choices. The choice of a vision enco der is critical for bridg- ing the significant domain gap b etw een street-view and satellite imagery . T o as- sess the inherent capability of different state-of-the-art vision foundation mo dels for this task, we ev aluate them as frozen feature extractors, training only the subsequen t autoregressive p olicy mo dule (T able 3). Self-sup ervised mo dels suc h as DINOv2 [24] and DINOv3 [30] significan tly outperform the language-aligned SigLIP2 [36]. Although SigLIP2 retains some transferable understanding, its p erformance is insufficient for the fine-grained spatial dis crimination required for accurate geo-localization. Based on these results, w e select DINOv2 as the default backbone for all subsequen t exp erimen ts. Vision enco der fine-tuning. Although foundation mo dels such as DINOv2 pro vide strong general-purp ose visual features, they are not nativ ely aligned to drastic viewpoint c hanges b et ween street-view and satellite imagery . T o assess the need for task-sp ecific adaptation, w e exp erimen ted with different n um b ers of frozen lay ers during training. This allo ws the highest-level semantic features to adapt to the domain shift b et ween street-view and satellite represen tations. As shown in T able 3, the strictly off-the-shelf fe atures are insufficient for high- precision lo calization. Increasing the num b er of trainable lay ers b ey ond the final six leads to diminishing returns. This observ ation aligns with the in tuition that 14 Y. T. Erzurumlu et al. earlier lay ers of foundation mo dels enco de general geometric and texture-level features, whereas later lay ers capture higher-level semantic information that b en- efits from adaptation to cross-view corresp ondence [14]. Global vs. lo cal patch features. Our default mo del uses only the [CLS] tok en from the vision enco der, which serves as a global descriptor for the entire satellite patch. W e also in vestigated whether incorp orating lo cal patch tokens from DINOv2 can improv e fine-grained matching. T o test this, w e compare the single- [CLS] baseline against a v ariant that sele ctiv ely includes the S patc h to- k ens most similar to the street-view [CLS] token. As shown in T able 4, incorp o- rating lo cal patch tokens, even when filtered based on seman tic similarity , leads to a performance decline. This result suggests that the autoregressive zo oming p olicy b enefits from a high-lev el global representation when making sequen tial decisions, and that introducing fine-grained lo cal details may in tro duce noise or o ver-fit to app earance v ariations rather than stable structural cues. 5 Conclusion W e in tro duced Just Zo om In, an autoregressiv e framew ork for cross-view geo- lo calization that replaces single-shot contrastiv e retriev al with a sequen tial, coarse- to-fine lo calization pro cess. By jointly enco ding street-view and multi-scale satel- lite imagery into a shared representation and predicting zo om actions with a causal transformer, our metho d progressively narrows the searc h region while preserving global spatial con text. This design enables efficien t inference that scales logarithmically with geographic area, a voiding the computational and rep- resen tation b ottlenec ks of contrastiv e matching. W e further curated a new multi- scale dataset composed of go vernmen t-sourced orthophotograph y and cro wd- sourced street-view imagery , pro viding a realistic and reproducible b enchmark for cross-view localization. Extensiv e experiments and ablations demonstrate that our autoregressive formulation yields consisten tly higher lo calization accu- racy and robustness compared to state-of-the-art contrastiv e baselines, particu- larly in challenging real-w orld conditions. W e hop e this w ork motiv ates a shift from retriev al-based matching tow ard sequential spatial reasoning in cross-view lo calization. References 1. Arandjelo vic, R., Gronát, P ., T orii, A., Pa jdla, T., Sivic, J.: Netvlad: CNN arc hi- tecture for weakly sup ervised place recognition. CoRR abs/1511.07247 (2015), 2. Berton, G., Masone, C., Caputo, B.: Rethinking visual geo-localization for large- scale applications. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4878–4888 (June 2022) 3. Cai, S., Guo, Y., Khan, S., Hu, J., W en, G.: Ground-to-aerial image geo-lo calization with a hard exemplar reweigh ting triplet loss. In: 2019 IEEE/CVF In ternational Conference on Computer Vision (ICCV). pp. 8390–8399 (2019). https://doi.org/ 10.1109/ICCV.2019.00848 Just Zo om In for Cross-View Geo-Lo calization 15 4. con tributors, M.: Mapillary . https://www.mapillary.com/ (2013) 5. Deuser, F., Hab el, K., Oswald, N.: Sample4geo: Hard negative sampling for cross- view geo-lo calisation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 16801–16810 (2023). https: //doi .org/10 .1109/ ICCV51070. 2023.01545 6. Doso vitskiy , A., Beyer, L., Kolesnik ov, A., W eissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly , S., Uszkoreit, J., Houlsby , N.: An image is w orth 16x16 words: T ransformers for image recognition at scale. In: In- ternational Conference on Learning Representations (2021), https://openreview. net/forum?id=YicbFdNTTy 7. F ervers, F., Bullinger, S., Bo densteiner, C., Arens, M., Stiefelhagen, R.: Statewide visual geolo calization in the wild. In: ECCV (2024) 8. Go ogle Maps Platform: Maps static api (Aug 2025), https://developers.google. com/ maps/ documentation/ maps- static/ overview , go ogle Maps Platform Docu- men tation 9. Go ogle M aps Platform: Street view static api (Aug 2025), https:/ / developers. google.com /maps/documentation /streetview , go ogle Maps Platform Documen- tation 10. Go vernmen t of the District of Columbia: Op en data dc. https :/ /opendata .dc . gov/ 11. Grattafiori, A., Dub ey , A., Jauhri, A., Pandey , A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., V aughan, A., et al.: The llama 3 herd of mo dels. arXiv preprint arXiv:2407.21783 (2024) 12. Ho, C., Zou, J., Alama, O., Kumar, S.M.J., Chiang, B., Gupta, T., W ang, C., Keetha, N., Sycara, K., Scherer, S.: Map it anywhere (mia): Emp o wering bird’s eye view mapping using large-scale public data. In: Adv ances in Neural Information Pro cessing Systems (2024), 13. Hu, S., F eng, M., Nguyen, R.M.H., Lee, G.H.: Cvm-net: Cross-view matching net work for image-based ground-to-aerial geo-lo calization. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7258–7267 (2018). https://doi.org/10.1109/CVPR.2018.00758 14. Izquierdo, S., Civ era, J.: Optimal transport aggregation for visual place recogni- tion. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 17658–17668 (2024) 15. Krizhevsky , A., Sutskev er, I., Hinton, G.E.: Imagenet classification with deep con- v olutional neural netw orks. In: Pereira, F., Burges, C., Bottou, L., W einberger, K. (eds.) Adv ances in Neural Information Processing Systems. vol. 25. Curran Asso ciates, Inc. (2012), https ://proceedings.neurips .cc/paper_files /paper/ 2012/file/c399862d3b9d6b76c8436e924a68c45b- Paper.pdf 16. Lee, W., Park, J., Hong, D., Sung, C., Seo, Y., Kang, D., Myung, H.: Pidlo c: Cross- view p ose optimization netw ork inspired b y pid con trollers. In: 2025 IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 21981– 21990 (2025). https://doi.org/10.1109/CVPR52734.2025.02047 17. Len tsch, T., Xia, Z., Caesar, H., K o oij, J.F.P .: Slicematc h: Geometry-guided ag- gregation for cross-view p ose estimation. In: 2023 IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 17225–17234 (2023). https://doi.org/10.1109/CVPR52729.2023.01652 18. Lin, T.Y., Belongie, S., Hays, J.: Cross-View Image Geolo calization. In: 2014 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 891– 898 (2014), https: //faculty .cc. gatech.edu /~hays/ papers/geolocalization _ cvpr13_0421.pdf 16 Y. T. Erzurumlu et al. 19. Liu, L., Li, H.: Lending orientation to neural net works for cross-view geo- lo calization. In: The IEEE Conference on Computer Vision and P attern Recog- nition (CVPR) (June 2019) 20. Liu, Z., Mao, H., W u, C.Y., F eich tenhofer, C., Darrell, T., Xie, S.: A con vnet for the 2020s. In: 2022 IEEE/CVF Conference on Computer Vision and P attern Recog- nition (CVPR). pp. 11966–11976 (2022). https:/ /doi.org/ 10.1109/CVPR52688. 2022.01167 21. Loshc hilov, I., Hutter, F.: Decoupled weigh t decay regularization. In: International Conference on Learning Representations (2019), https://openreview.net/forum? id=Bkg6RiCqY7 22. Mi, L., Béchaz, M., Chen, Z., Bosselut, A., T uia, D.: GeoExplorer: Activ e geo- lo calization with curiosity-driv en exploration. arXiv preprint (2025) 23. Mi, L., Xu, C., Castillo-Nav arro, J., Montariol, S., Y ang, W., Bosselut, A., T uia, D.: Congeo: Robust cross-view geo-lo calization across ground view v ariations. In: Leonardis, A., Ricci, E., Roth, S., Russak ovsky , O., Sattler, T., V arol, G. (eds.) Computer Vision – ECCV 2024. pp. 214–230. Springer Nature Switzerland, Cham (2025) 24. Oquab, M., Darcet, T., Moutak anni, T., V o, H.V., Szafraniec, M., Khalidov, V., F ernandez, P ., Haziza, D., Massa, F., El-Noub y , A., Ho wes, R., Huang, P .Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P ., Joulin, A., Bo janowski, P .: Dinov2: Learning robust visual features without sup ervision (2023) 25. Regmi, K., Shah, M.: Bridging the domain gap for ground-to-aerial image match- ing. In: 2019 IEEE/CVF In ternational Conference on Computer Vision (ICCV). pp. 470–479 (2019). https://doi.org/10.1109/ICCV.2019.00056 26. Sark ar, A., Sastry , S., Pirinen, A., Zhang, C., Jacobs, N., V orob eyc hik, Y.: Gomaa-geo: Goal modality agnostic activ e geo-lo calization. arXiv preprin t arXiv:2406.01917 (2024) 27. Sarlin, P .E., DeT one, D., Y ang, T.Y., A v etisyan, A., Straub, J., Malisiewicz, T., Bulo, S.R., Newcom b e, R., Kon tschieder, P ., Baln tas, V.: Orienternet: Visual lo cal- ization in 2d public maps with neural matc hing. In: 2023 IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 21632–21642 (2023). https://doi.org/10.1109/CVPR52729.2023.02072 28. Shi, Y., Liu, L., Y u, X., Li, H.: Spatial-aw are feature aggregation for image based cross-view geo-lo calization. In: W allach, H., Laro chelle, H., Beygelzimer, A., d ' Alché-Buc, F., F ox, E., Garnett, R. (eds.) Adv ances in Neural Information Pro cessing Systems. vol. 32. Curran Asso ciates, Inc. (2019), https : / / proceedings . neurips . cc / paper _ files / paper / 2019 / file / ba2f0015122a5955f8b3a50240fb91b2- Paper.pdf 29. Sh ugaev, M., Semeno v, I., Ashley , K., Klaczynski, M., Cuntoor, N., Lee, M.W., Jacobs, N.: Arcgeo: Lo calizing limited field-of-view images using cross-view match- ing. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). pp. 208–217 (2024). https://doi.org/10.1109/WACV57701.2024.00028 30. Siméoni, O., V o, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., W ehrstedt, L., W ang, J., Darcet, T., Moutak anni, T., Sen tana, L., Rob erts, C., V edaldi, A., T olan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P ., Bo janowski, P .: DINOv3 (2025), Just Zo om In for Cross-View Geo-Lo calization 17 31. Song, Z., Zhang, Y., Li, K., W ang, L., Guo, Y.: A unified hierarchical framework for fine-grained cross-view geo-lo calization o ver large-scale scenarios (2025), https: 32. Su, J., Lu, Y., Pan, S., Murtadha, A., W en, B., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding (2023), https : / / arxiv . org / abs / 2104.09864 33. T oker, A., Zhou, Q., Maximov, M., Leal-T aixé, L.: Coming down to earth: Satellite- to-street view syn thesis for geo-lo calization. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6484–6493 (2021). https: //doi.org/10.1109/CVPR46437.2021.00642 34. T olstikhin, I., Houlsb y , N., Kolesnik ov, A., Beyer, L., Zhai, X., Unterthiner, T., Y ung, J., Steiner, A., Keysers, D., Uszk oreit, J., Lucic, M., Dosovitskiy , A.: Mlp- mixer: an all-mlp arc hitecture for vision. NIPS ’21, Curran Asso ciates Inc., Red Ho ok, NY, USA (2021) 35. T ong, S., Xia, Z., Alahi, A., He, X., Shi, Y.: Geodistill: Geometry-guided self- distillation for w eakly sup ervised cross-view lo calization (2025), https:/ /arxiv. org/abs/2507.10935 36. T schannen, M., Gritsenk o, A., W ang, X., Naeem, M.F., Alabdulmohsin, I., P arthasarathy , N., Ev ans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harm- sen, J., Steiner, A., Zhai, X.: Siglip 2: Multilingual vision-language enco ders with impro ved semantic understanding, localization, and dense features (2025), 37. V aswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Atten tion is all you need. p. 6000–6010. NIPS’17, Curran Asso- ciates Inc., Red Ho ok, NY, USA (2017) 38. V o, N.N., Ha ys, J.: Lo calizing and orienting street views using ov erhead imagery . In: Leib e, B., Matas, J., Sebe, N., W elling, M. (eds.) Computer Vision – ECCV 2016. pp. 494–509. Springer International Publishing, Cham (2016) 39. W orkman, S., Souvenir, R., Jacobs, N.: Wide-Area Image Geolo calization with A erial Reference Imagery. In: 2015 IEEE International Conference on Computer Vision (ICCV). pp. 3961–3969. IEEE, Santiago, Chile (Dec 2015). https: //doi. org/10.1109/ICCV.2015.451 40. W u, H., Zhang, Z., Lin, S., Mu, X., Zhao, Q., Y ang, M., Qin, T.: Maplo cnet: Coarse- to-fine feature registration for visual re-localization in na vigation maps. In: 2024 IEEE/RSJ In ternational Conference on Intelligen t Rob ots and Systems (IROS). pp. 13198–13205 (2024). https://doi.org/10.1109/IROS58592.2024.10802757 41. Xia, Z., Alahi, A.: F g2 : Fine-grained cross-view lo calization by fine-grained feature matc hing. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 6362–6372 (2025). https: / / doi . org / 10 . 1109 / CVPR52734 . 2025.00596 42. Zhang, B., Sennric h, R.: Ro ot mean square lay er normalization. Curran Asso ciates Inc., Red Ho ok, NY, USA (2019) 43. Zh u, S., Shah, M., Chen, C.: T ransgeo: T ransformer is all you need for cross-view image geo-lo calization. In: 2022 IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 1152–1161 (2022). https://doi.org/10.1109/ CVPR52688.2022.00123 44. Zh u, S., Y ang, T., Chen, C.: Vigor: Cross-view image geo-localization b ey ond one- to-one retriev al. In: 2021 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). pp. 5316–5325 (2021). https : / / doi . org / 10 . 1109 / CVPR46437.2021.00364 18 Y. T. Erzurumlu et al. 45. Zh u, Y., Y ang, H., Lu, Y., Huang, Q.: Simple, effectiv e and general: A new back- b one for cross-view i mage geo-lo calization (2023), 01572
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment