Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-resolution remote sensing imagery is characterized by densely distributed land-cover objects and complex boundaries, which places higher demands on both geometric localization and semantic prediction. Existing training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs) using “one-way injection” and “shallow post-processing” strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross-model attention fusion (CAF) module, which guides collaborative inference by injecting self-attention maps into each other. Second, we propose a bidirectional cross-graph diffusion refinement (BCDR) module that enhances the reliability of dual-branch segmentation scores through iterative random-walk diffusion. Finally, we incorporate low-level superpixel structures and develop a convex-optimization-based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Moreover, ablation studies further confirm that traditional object-based remote sensing image analysis methods leveraging superpixel structures remain effective within deep learning frameworks. Code: https://github.com/yu-ni1989/SDCI.

💡 Research Summary

The paper introduces SDCI, a training‑free open‑vocabulary semantic segmentation framework specifically designed for high‑resolution remote sensing imagery, which is characterized by densely packed objects and intricate boundaries. Existing training‑free methods typically combine CLIP (a vision‑language model) and vision foundation models (VFMs) such as DINO in a “one‑way injection” or shallow post‑processing fashion, limiting their ability to simultaneously capture fine‑grained spatial detail and high‑level semantic meaning. SDCI addresses these limitations through three tightly integrated modules:

Cross‑Model Attention Fusion (CAF) – During feature encoding, both CLIP and DINO process the same image in parallel. The self‑attention maps from each Transformer block are averaged, symmetrized, ℓ1‑normalized, and passed through a ReLU. The resulting attention map from CLIP is injected into DINO’s value vectors, while DINO’s final attention map is injected into CLIP’s last‑layer features. This bidirectional injection enables CLIP’s high‑level semantic cues to guide DINO’s structural representations and vice‑versa, producing initial logits that are both semantically accurate and spatially precise.
Bidirectional Cross‑Graph Diffusion Refinement (BCDR) – The initial logits are refined globally by constructing two complementary graphs. A semantic graph is built from CLIP features: after ℓ2‑normalization, pairwise cosine similarities are sparsified to the K‑nearest neighbors, scaled by a temperature τ=0.07, and row‑normalized to form a transition matrix T_clip. A structural graph is analogously derived from DINO features, yielding T_dino. Random‑walk diffusion is performed on each graph, and the resulting refined scores are exchanged between the two branches iteratively. The semantic graph corrects structural inconsistencies, while the structural graph merges semantically fragmented regions, leading to globally consistent predictions.
Convex‑Optimization based Superpixel Collaborative Prediction (CSCP) – To sharpen object boundaries, the method generates low‑level superpixels (e.g., via an SLIC‑like algorithm) and formulates a binary labeling problem over superpixel regions. The energy function combines (i) the average probability logits from both branches, (ii) a total variation term enforcing label smoothness within each superpixel, and (iii) a regularization term penalizing label discontinuities across neighboring superpixels. The resulting convex problem is solved efficiently with ADMM or a primal‑dual scheme, yielding a final segmentation map that aligns tightly with true object contours.

Extensive experiments on three public remote‑sensing benchmarks—ISPRS Vaihingen, ISPRS Potsdam, and DeepGlobe—demonstrate that SDCI outperforms state‑of‑the‑art training‑free approaches (including CASS, SegEarth‑OV, and various CLIP‑only or DINO‑only baselines) by a substantial margin in mean Intersection‑over‑Union (mIoU) and Boundary F1 scores. Ablation studies reveal that removing any of the three modules leads to a noticeable performance drop, confirming that CAF, BCDR, and CSCP each contribute uniquely and synergistically. Notably, CSCP provides the most pronounced gains on classes with highly complex boundaries (e.g., buildings and roads), validating the importance of low‑level geometric priors in remote‑sensing segmentation.

In summary, SDCI delivers a hierarchical, fully training‑free pipeline that fuses high‑level semantic knowledge from CLIP with fine‑grained structural cues from DINO, refines them through bidirectional graph diffusion, and finally enforces superpixel‑level geometric consistency via convex optimization. This combination enables accurate, boundary‑precise open‑vocabulary segmentation of remote‑sensing imagery without any task‑specific fine‑tuning, and it showcases how classic object‑based analysis (superpixels) can be seamlessly integrated into modern deep‑learning frameworks for geospatial applications.

Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery

💡 Research Summary

Comments & Academic Discussion

Leave a Comment