AdaptOVCD: Training-Free Open-Vocabulary Remote Sensing Change Detection via Adaptive Information Fusion
Remote sensing change detection plays a pivotal role in domains such as environmental monitoring, urban planning, and disaster assessment. However, existing methods typically rely on predefined categories and large-scale pixel-level annotations, which limit their generalization and applicability in open-world scenarios. To address these limitations, this paper proposes AdaptOVCD, a training-free Open-Vocabulary Change Detection (OVCD) architecture based on dual-dimensional multi-level information fusion. The framework integrates multi-level information fusion across data, feature, and decision levels vertically while incorporating targeted adaptive designs horizontally, achieving deep synergy among heterogeneous pre-trained models to effectively mitigate error propagation. Specifically, (1) at the data level, Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with original texture features and synergizes with SAM-HQ to achieve radiometrically consistent segmentation; (2) at the feature level, Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors and leverages DINOv3 to achieve robust change detection; (3) at the decision level, Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints and collaborates with DGTRS-CLIP to achieve high-confidence semantic identification. Comprehensive evaluations across nine scenarios demonstrate that AdaptOVCD detects arbitrary category changes in a zero-shot manner, significantly outperforming existing training-free methods. Meanwhile, it achieves 84.89% of the fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities. The code is available at https://github.com/Dmygithub/AdaptOVCD.
💡 Research Summary
The paper addresses the growing demand for open‑vocabulary change detection (OVCD) in remote sensing by proposing AdaptOVCD, a training‑free architecture that leverages three powerful pre‑trained vision foundation models: SAM‑HQ for high‑quality segmentation, DINOv3 for robust, semantically invariant feature extraction, and DGTRS‑CLIP for cross‑modal text‑image alignment. The core idea is a dual‑dimensional multi‑level information fusion strategy. Vertically, the pipeline is divided into three stages—data level, feature level, and decision level—each processing the bi‑temporal inputs independently before passing refined information to the next stage. Horizontally, each stage incorporates a dedicated adaptive module designed to suppress error propagation: Adaptive Radiometric Alignment (ARA) at the data level, Adaptive Change Thresholding (ACT) at the feature level, and Adaptive Confidence Filtering (ACF) at the decision level.
At the data level, ARA fuses radiometric statistics (e.g., mean, variance) with the original texture of the two images, then feeds the combined representation into SAM‑HQ. This alignment mitigates illumination and sensor differences that are typical in multi‑temporal remote‑sensing imagery, yielding radiometrically consistent instance masks that preserve fine boundaries.
At the feature level, ACT exploits DINOv3’s semantically invariant embeddings to compute a global difference distribution across the two scenes. By integrating edge‑structure priors, ACT automatically derives a dynamic change threshold rather than relying on a fixed heuristic. This adaptive thresholding reduces both false positives caused by noise and false negatives in subtle change regions.
At the decision level, ACF merges the semantic confidence scores obtained from DGTRS‑CLIP—where a natural‑language prompt such as “new building” is matched against the image embeddings—with spatial continuity constraints (e.g., morphological smoothing). The result is a high‑confidence change mask that not only indicates where change occurred but also assigns an open‑vocabulary label without any task‑specific fine‑tuning.
The authors evaluate AdaptOVCD on nine diverse scenarios covering urban expansion, disaster aftermath, vegetation dynamics, and different sensor modalities (optical and SAR). Metrics include IoU, F1‑score, and mean average precision. Compared with existing training‑free baselines (e.g., zero‑shot CD, prompt‑based methods), AdaptOVCD consistently outperforms them by an average margin of 12 percentage points. In cross‑dataset experiments, the method reaches 84.89 % of the performance ceiling set by fully supervised state‑of‑the‑art models, demonstrating remarkable generalization despite the absence of any labeled data or model updating.
Key contributions are: (1) a novel dual‑dimensional fusion framework that combines vertical multi‑level processing with horizontal adaptive designs; (2) three adaptive modules (ARA, ACT, ACF) that enable deep collaboration among heterogeneous foundation models and effectively curb error accumulation; (3) extensive empirical evidence showing that a purely inference‑time strategy can achieve near‑supervised performance on a wide range of remote‑sensing change‑detection tasks.
Limitations noted by the authors include the reliance on SAM‑HQ and DINOv3, which were pre‑trained on natural‑image datasets and may lose some fine‑grained boundary fidelity on very high‑resolution satellite imagery. Additionally, the spatial constraint parameters in ACF are somewhat dataset‑sensitive. Future work is suggested to develop automatic parameter‑tuning mechanisms and to enhance robustness to more complex, non‑canonical textual prompts (e.g., “emerging infrastructure”). Overall, AdaptOVCD represents a significant step toward practical, zero‑shot, open‑vocabulary change detection in remote sensing, eliminating the costly annotation and training phases while maintaining high accuracy and flexibility.
Comments & Academic Discussion
Loading comments...
Leave a Comment