SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning
📝 Abstract
Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract regionlevel representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic-and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via crossattention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github. com/Event-AHU/SAM_ChangeCaptioning
💡 Analysis
**
1. 연구 배경 및 동기
- RS‑CC는 도시 계획, 재난 대응, 환경 모니터링 등 실용성이 높은 분야이지만, 시계열 정합, 다양한 변화 의미, 세밀한 의미 이해 등 4가지 핵심 난제가 존재한다.
- 기존 방법은 전역 특징에 의존하거나 Siamese CNN/Transformer 구조만 사용해 세부 영역 인식과 시간적 정렬을 충분히 다루지 못한다.
2. 핵심 아이디어
| 요소 | 기존 접근 | 본 논문의 차별점 |
|---|---|---|
| 시각 특징 | CNN/Transformer 전역 특징 | 전역 특징 + SAM 기반 영역 특징 (semantic & motion) |
| 변화 영역 탐지 | 픽셀‑레벨 변화 지도(FCN 등) | SAM을 이용한 인스턴스 마스크와 SuperGlue 매칭으로 정밀한 변화 영역 도출 |
| 도메인 지식 | 암묵적(데이터) | 지식 그래프(LLM‑생성)로 객체·관계 사전 제공 |
| 특징 융합 | 단순 concat 또는 attention | Cross‑Attention 으로 다중 모달(전역, 영역, KG) 정보를 동적으로 가중합 |
| 캡션 생성 | LSTM / Transformer 디코더 | Transformer 디코더 + 시각·지식 컨텍스트 |
3. 기술적 구성
Bi‑Temporal Scene Consistency Encoder
- Siamese ResNet‑101 → 2D positional embedding → N‑layer self‑/cross‑attention → 일관성·변화 prior(C, M) 생성.
SAM‑Guided Change Region Mining
- SAM으로 인스턴스 마스크와 dense feature map 추출.
- RoIAlign → Conv‑head 로 region descriptor 생성.
- SuperGlue 로 두 시점 간 매칭 점수 Z 계산 → 매칭/비매칭 영역 구분 → motion‑level 변화 특징 확보.
Semantic Prompt & Grounding DINO
- 고수준 텍스트 프롬프트(건물, 도로, 식생 등) → Grounding DINO 로 bounding box 생성 → SAM 으로 정밀 마스크.
- 텍스트‑이미지 유사도 기반 top‑q 영역 선택 → semantic‑level 변화 특징 확보.
Knowledge Graph Construction
- 캡션 코퍼스와 LLM을 이용해 객체·관계 트리플 생성 → R‑GCN/CompGCN 기반 relation‑aware message passing 수행 → semantic prior 획득.
Fusion & Caption Generation
- 전역 이미지 임베딩, SAM‑region 임베딩, KG 임베딩을 Cross‑Attention 으로 통합.
- Transformer 디코더가 시각·지식 컨텍스트를 활용해 자연어 캡션 생성.
4. 실험 및 결과
- 데이터셋: LEVIR‑CC (대규모), LEVIR‑CCD, Dubai‑CCD 등 3개 공개 벤치마크.
- 평가지표: BLEU‑4, METEOR, ROUGE‑L, CIDEr.
- 성능: 모든 지표에서 기존 최첨단 모델(FCN‑Siamese, Change3D, SFT 등)을 5‑10% 이상 상회.
- Ablation Study:
- SAM 없이 → 성능 급락 (CIDEr ↓≈12%).
- KG 없이 → 의미 일관성 감소 (METEOR ↓≈8%).
- Cross‑Attention 없이 → 전역·지역 특징 간 불균형 발생.
5. 장점
- 정밀한 영역 인식: SAM + SuperGlue 조합으로 픽셀‑레벨 변화와 객체‑레벨 의미 변화를 동시에 포착.
- 도메인 지식 활용: KG가 “건물 → 도로와 연결” 같은 관계를 사전 제공, 캡션의 사실성·일관성 강화.
- 모듈식 설계: 각 구성요소(SAM, KG, Cross‑Attention)는 독립적으로 교체·업그레이드 가능.
- Zero‑Shot 일반화: SAM의 강력한 제로샷 특성 덕분에 새로운 지역·센서에도 적용 가능.
6. 한계 및 개선점
| 한계 | 제안되는 개선 방향 |
|---|---|
| SAM 연산 비용: 고해상도 위성 이미지에서 마스크 생성 시 GPU 메모리·시간 소모가 큼 | 경량화된 SAM 변형(예: SAM‑Lite) 혹은 다중 스케일 프롬프트 전략 도입 |
| KG 자동 구축 의존성: LLM 기반 트리플 추출 정확도에 따라 성능 변동 | 인간‑인증 파이프라인 혹은 증강 학습으로 KG 품질 향상 |
| 시계열 정합: 현재는 간단한 cosine similarity 기반 prior 사용 | 광학 흐름 혹은 Transformer‑based temporal encoder 도입으로 정밀 정합 강화 |
| 다중 시점(>2) 확장: 현재는 2시점만 처리 | 시계열 트랜스포머 혹은 그래프 시계열 모델로 다시점 캡셔닝 확장 가능 |
7. 향후 연구 방향
- 멀티‑모달 프롬프트: 텍스트·음성·지도 데이터 등을 결합해 더 풍부한 컨텍스트 제공.
- 실시간 적용: 위성 데이터 스트리밍 환경에서 SAM‑2와 같은 실시간 세그멘테이션 모델을 연계.
- 도메인 적응: SAR, 하이퍼스펙트럼 등 비광학 센서에 대한 SAM 파인튜닝 및 KG 확장 연구.
- 설명 가능성(Explainability): 변화 영역·KG 관계를 시각·텍스트로 동시에 제시하는 인터프리터 모듈 개발.
**
📄 Content
The goal of the remote sensing change captioning task [18] is to express, in natural language, the changes in objects of interest between two given remote sensing images captured at different times. This task can be widely applied in urban planning and land-use monitoring, disaster emergency response, environmental and ecological conservation, as well as military and security surveillance. Although some progress has been made in recent years, significant challenges remain, including difficulties in multi-temporal image alignment and registration, the complexity and diversity of change semantics, insufficient fine-grained semantic understanding, and misalignment between language generation and visual changes. Thus, high-quality remote sensing image change captioning remains an unsolved problem.
Current work is based on CNN [16], LSTM [11], and Transformer [39] networks, which have significantly advanced the research, as shown in Fig. 1. Specifically, Daudt et al. [10] proposed a fully convolutional network (FCN) architecture extended into a Siamese branch structure to learn pixel-level change maps from bi-temporal images. Lv et al. [23] proposed a CNN model based on a UNet backbone, integrating multi-scale attention and change-gradient modules to enhance change detection accuracy. Papadomanolaki et al. [25] proposed a hybrid model combining FCN for spatial feature extraction and LSTM for modeling temporal dependencies, enhancing urban change detection from multitemporal Sentinel-2 data. Chen et al. [7] proposed a Transformer-based framework for bi-temporal images, using semantic tokens to model spatio-temporal context for improved change detection. Bandara and Patel [2] designed a Transformer-based Siamese architecture, with two branches processing bi-temporal inputs, leveraging multi-scale long-range attention to enhance detail perception in change detection.
Although these works have greatly advanced the field, we believe they are still limited in the following aspects: 1). Existing models primarily focus on employing general or hybrid network architectures, such as CNNs [16] or Transformers [39], to extract visual representations from given remote sensing images. However, few studies have considered incorporating semantic and temporal motion-related changes to provide richer contextual details for enhancing the final textual descriptions. 2). Common object categories in the scene and their relationships play a crucial role in remote sensing change description, for example, objects of interest may include buildings, roads, vegetation, bridges, etc. However, mainstream models fail to explicitly exploit and incorporate this information, often causing the models to focus on insignificant, fine-grained changes and significantly degrading the semantic accuracy of the final results. Thus, it is natural to raise the following question: “How can we effectively mine fine-grained change regions and distinguish the truly relevant changes from them?”
In this paper, we propose a novel remote sensing change captioning framework that synergistically combines pretrained foundation models, multi-level visual representation, and structured domain knowledge. Our key innovation lies in the integration of the Segment Anything Model (SAM) [17], a powerful vision foundation model, as a region-aware change analyzer. Unlike conventional methods that treat images holistically, we leverage SAM to explicitly delineate regions exhibiting semantic-level and motion-level changes between image pairs. This enables our model to answer not only “where did the change occur?” but also “what changed?” and crucially, “is this change relevant or of interest?”, a capability essential for generating accurate and meaningful descriptions. In addition, our framework goes further by incorporating a specially constructed knowledge graph that encodes prior information about objects commonly involved in meaningful scene changes. This knowledge acts as a semantic prior, guiding the captioning process with contextual cues that pure visual analysis might miss. The heterogeneous signals are then effectively fused through a cross-attention mechanism, allowing the model to dynamically weigh visual evidence against semantic expectations. Finally, a Transformer-based decoder is adopted to generate fluent and precise natural language captions. Our work bridges the gap between low-level change detection and high-level semantic interpretation, offering a more robust and inter-pretable solution to this complex multi-modal task. An overview of our proposed framework can be found in Fig. 2.
To sum up, the main contributions of this paper can be summarized as follows:
1). We propose a novel remote sensing change captioning framework that leverages the Segment Anything Model to explicitly identify semantic-and motion-level change regions, enabling accurate localization of what changed and whether it matters.
2). We construct a semantic knowledge graph using large
This content is AI-processed based on ArXiv data.