Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive performance.Our code is available at https://github.com/ISChenawei/SpikeViMFormer

💡 Research Summary

This paper introduces SpikeViMFormer, the first spiking neural network (SNN) framework designed for drone‑view geo‑localization (DVGL), a cross‑view image retrieval task where a drone‑captured image must be matched to a geo‑referenced satellite image. Traditional DVGL solutions rely on artificial neural networks (ANNs) that achieve high accuracy but require dense multiply‑and‑accumulate (MAC) operations, leading to substantial power consumption unsuitable for resource‑constrained platforms such as drones. While SNNs offer event‑driven, sparse computation and thus promise low energy usage, their binary spiking nature often discards critical visual details and struggles to capture long‑range dependencies needed for precise cross‑view alignment.

SpikeViMFormer addresses these challenges through three core components. First, a lightweight spike‑driven transformer backbone processes input images as spike sequences using Leaky‑Integrate‑Fire (LIF) neurons. The backbone extracts coarse‑grained features with only accumulate‑and‑compare (AC) operations, dramatically reducing computational load. Second, the Spike‑Driven Selective Attention (SSA) block introduces a spike‑driven gating mechanism that modulates attention both locally (channel‑wise) and globally (across the token sequence). By allowing only salient spikes to pass, SSA mitigates information loss caused by the inherent sparsity of SNNs and highlights discriminative regions such as building edges or road intersections. Third, the Spike‑Driven Hybrid State Space (SHS) block learns long‑range dependencies by alternating feature representations between sequential token format and 2‑D spatial layout. This hybrid state‑space enables the model to capture global contextual relationships via self‑attention while preserving local spatial priors through reshaped convolutive operations, thereby reducing confusion between visually similar but geographically distinct areas.

To ensure that the backbone remains effective when the auxiliary SSA and SHS modules are removed during inference, the authors propose a Hierarchical Re‑ranking Alignment Learning (HRAL) strategy. HRAL recasts conventional re‑ranking (k‑reciprocal, Gaussian weighting, query‑expansion smoothing) as a supervised training signal. It generates refined features through neighborhood re‑ranking, enforces cross‑batch consistency, and aligns the backbone’s raw features with these refined counterparts via a dedicated loss term. Consequently, the backbone learns to produce high‑quality embeddings even without the auxiliary blocks at test time.

Extensive experiments on public UAV‑Sat datasets and a custom drone‑satellite pairing set demonstrate that SpikeViMFormer outperforms state‑of‑the‑art SNNs (e.g., Spiking‑ResNet, Spiking‑ViT) by 6–8 % in Top‑1 accuracy and improves mean average precision (mAP) by over 5 %. Compared with leading ANN‑based methods such as TransFG and MEAN, SpikeViMFormer achieves comparable accuracy (within 0.5 % gap) while delivering a 13.24× reduction in inference energy and an 8.4× reduction in parameter count. Neuromorphic hardware simulations (e.g., on the Speck ASIC) show inference power below 1 mW, indicating a substantial extension of drone battery life.

In summary, SpikeViMFormer demonstrates that SNNs can close the performance gap with ANNs on complex cross‑view geo‑localization tasks by integrating spike‑driven selective attention, hybrid state‑space modeling, and a novel re‑ranking‑based training paradigm. The work provides a practical pathway for deploying high‑accuracy, ultra‑low‑power visual localization on drones, autonomous robots, and other edge devices where energy efficiency is paramount.

Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment