SCA-Net: Spatial-Contextual Aggregation Network for Enhanced Small Building and Road Change Detection

SCA-Net: Spatial-Contextual Aggregation Network for Enhanced Small Building and Road Change Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated change detection in remote sensing imagery is critical for urban management, environmental monitoring, and disaster assessment. While deep learning models have advanced this field, they often struggle with challenges like low sensitivity to small objects and high computational costs. This paper presents SCA-Net, an enhanced architecture built upon the Change-Agent framework for precise building and road change detection in bi-temporal images. Our model incorporates several key innovations: a novel Difference Pyramid Block for multi-scale change analysis, an Adaptive Multi-scale Processing module combining shape-aware and high-resolution enhancement blocks, and multi-level attention mechanisms (PPM and CSAGate) for joint contextual and detail processing. Furthermore, a dynamic composite loss function and a four-phase training strategy are introduced to stabilize training and accelerate convergence. Comprehensive evaluations on the LEVIR-CD and LEVIR-MCI datasets demonstrate SCA-Net’s superior performance over Change-Agent and other state-of-the-art methods. Our approach achieves a significant 2.64% improvement in mean Intersection over Union (mIoU) on LEVIR-MCI and a remarkable 57.9% increase in IoU for small buildings, while reducing the training time by 61%. This work provides an efficient, accurate, and robust solution for practical change detection applications.


💡 Research Summary

The paper introduces SCA‑Net (Spatial‑Contextual Aggregation Network), an enhanced change‑detection architecture built upon the Change‑Agent framework, specifically targeting the detection of small buildings and narrow roads in bi‑temporal remote‑sensing imagery. The backbone is a Siamese SegFormer‑B1 encoder that extracts four hierarchical feature maps from each timestamp. The authors augment this backbone with three major innovations.

  1. Difference Pyramid Block (DPB) – For each feature level the absolute difference between the two timestamps is computed, refined by a 1×1 convolution, and then progressively fused from the deepest (1/32 scale) to the shallowest (1/4 scale) through up‑sampling and 3×3 smoothing. This pyramid‑style refinement propagates high‑level semantic change cues down to finer resolutions, dramatically improving the localization of tiny objects.

  2. Adaptive Multi‑scale Processing – Low‑resolution levels (1/16, 1/32) are processed by a Multi‑Scale Shape Module that contains three dilated convolutions (rates 1, 2, 3) and two directional convolutions (1×5, 5×1) to capture extensive context and elongated structures such as roads. High‑resolution levels (1/4, 1/8) are handled by a High‑Res Enhance block that employs depthwise‑separable 3×3 convolutions followed by a pointwise 1×1 layer and an SE channel‑attention gate, enabling efficient extraction of fine‑grained details crucial for small‑building detection.

  3. Multi‑Level Attention (PPM + CSAGate) – A Pyramid Pooling Module aggregates global context from the deepest feature map using four pooling sizes (1×1, 2×2, 3×3, 6×6). In parallel, a Channel‑Spatial Attention Gate first applies channel‑wise attention and then spatial attention to re‑weight salient channels and locations. These attention streams are merged with the DPB output and fed into a top‑down decoder that uses bilinear up‑sampling and lateral connections to produce the final pixel‑wise change mask.

Training is stabilized by a four‑phase dynamic composite loss. Phase 1 (epochs 0‑10) emphasizes Cross‑Entropy; Phase 2 (10‑30) raises Dice and Contrastive weights to improve shape consistency; Phase 3 (30‑60) focuses on Lovász‑Softmax to directly optimize IoU; Phase 4 (60‑100) balances all components. Layer‑wise learning rates are applied (higher for newly added decoder/classification heads, lower for the pretrained backbone) and coordinated augmentations (MixUp, CutMix) are performed identically on both images of a pair to preserve temporal relationships.

Experiments on LEVIR‑MCI (10,077 image pairs, three classes) and LEVIR‑CD (637 pairs, binary) demonstrate substantial gains. Compared with the original Change‑Agent, SCA‑Net improves mean IoU by 2.64 % (0.8654 → 0.8918) and achieves a 57.9 % relative increase in IoU for small buildings (0.1724 → 0.7514). Training epochs drop from 200 to 78, a 61 % reduction in time, while inference speed remains comparable (~15 FPS). Against other state‑of‑the‑art methods (BIT, ChangeFormer, DMINet, SN‑UNet), SCA‑Net attains the highest F1‑Score (0.9462) and IoU (0.9052) on LEVIR‑CD.

Ablation studies confirm the incremental contribution of each component: enhanced BI³ layer lifts recall; adaptive multi‑scale processing raises road IoU; DPB adds semantic refinement; and the combined PPM + CSAGate yields the final performance boost. Qualitative visualizations show that SCA‑Net produces more continuous road segments and cleaner building outlines, with fewer false positives under challenging illumination and shadow conditions, thanks to the global context from PPM and the robust interaction in the enhanced BI³ layer.

In summary, SCA‑Net effectively integrates a difference pyramid, adaptive multi‑scale feature processing, and hierarchical attention mechanisms to achieve high‑sensitivity, low‑cost change detection for small urban objects. The work sets a new benchmark for precise building and road change monitoring and opens avenues for extending the approach to multi‑spectral or SAR data and for real‑time deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment