DSFC-Net: A Dual-Encoder Spatial and Frequency Co-Awareness Network for Rural Road Extraction

DSFC-Net: A Dual-Encoder Spatial and Frequency Co-Awareness Network for Rural Road Extraction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate extraction of rural roads from high-resolution remote sensing imagery is essential for infrastructure planning and sustainable development. However, this task presents unique challenges in rural settings due to several factors. These include high intra-class variability and low inter-class separability from diverse surface materials, frequent vegetation occlusions that disrupt spatial continuity, and narrow road widths that exacerbate detection difficulties. Existing methods, primarily optimized for structured urban environments, often underperform in these scenarios as they overlook such distinctive characteristics. To address these challenges, we propose DSFC-Net, a dual-encoder framework that synergistically fuses spatial and frequency-domain information. Specifically, a CNN branch is employed to capture fine-grained local road boundaries and short-range continuity, while a novel Spatial-Frequency Hybrid Transformer (SFT) is introduced to robustly model global topological dependencies against vegetation occlusions. Distinct from standard attention mechanisms that suffer from frequency bias, the SFT incorporates a Cross-Frequency Interaction Attention (CFIA) module that explicitly decouples high- and low-frequency information via a Laplacian Pyramid strategy. This design enables the dynamic interaction between spatial details and frequency-aware global contexts, effectively preserving the connectivity of narrow roads. Furthermore, a Channel Feature Fusion Module (CFFM) is proposed to bridge the two branches by adaptively recalibrating channel-wise feature responses, seamlessly integrating local textures with global semantics for accurate segmentation. Comprehensive experiments on the WHU-RuR+, DeepGlobe, and Massachusetts datasets validate the superiority of DSFC-Net over state-of-the-art approaches.


💡 Research Summary

The paper introduces DSFC‑Net, a dual‑encoder network specifically designed for extracting rural roads from high‑resolution remote sensing imagery. Rural road extraction poses three major challenges that differ from urban scenarios: (1) high intra‑class variability and low inter‑class separability caused by diverse surface materials (paved, unpaved, soil, crops), (2) frequent occlusions by vegetation that break spatial continuity, and (3) extremely narrow road widths that lead to severe foreground‑background imbalance. Existing CNN‑based methods excel at capturing local texture but lack global context, while transformer‑based methods provide long‑range dependencies but suffer from a bias toward low‑frequency information, resulting in over‑smoothed boundaries.

DSFC‑Net addresses these issues by combining two parallel encoders: a CNN branch built on ConvNeXt‑v2 and a novel Spatial‑Frequency Hybrid Transformer (SFT) branch. The CNN branch extracts fine‑grained local details, helping to discriminate roads from heterogeneous backgrounds. The SFT branch consists of three components: (i) Spatial Context Aggregator (SCA) that uses multi‑scale convolutions and window‑based self‑attention to gather long‑range spatial cues; (ii) Cross‑Frequency Interaction Attention (CFIA) that explicitly decomposes the feature map into high‑ and low‑frequency components via a Laplacian Pyramid, then lets the two frequency streams interact through a cross‑attention mechanism, thereby preserving high‑frequency edge information while leveraging low‑frequency global topology; (iii) Multi‑scale Feed‑Forward Network (MFFN) that further refines the fused representation. The outputs of SCA and CFIA are summed, passed through a point‑wise convolution, and added back to the input via a residual connection.

To merge the complementary information from the two branches, the authors propose a Channel Feature Fusion Module (CFFM). CFFM adopts a squeeze‑excitation‑style gating to recalibrate channel‑wise responses, dynamically weighting CNN‑derived texture features and transformer‑derived contextual features before they are passed to the decoder.

The overall architecture follows a four‑stage encoder‑decoder design. After an initial stem block, each stage contains N_i CNN blocks and L_i SFT layers. Down‑sampling is performed with a 2×2 stride‑2 convolution, and up‑sampling uses transposed convolutions with skip connections that concatenate low‑level fused features with high‑level decoder features. A final segmentation head (1×1 convolution + sigmoid) produces the binary road mask.

Extensive experiments were conducted on three public datasets: WHU‑RuR+ (a large, multi‑country rural road benchmark), DeepGlobe (global satellite imagery), and Massachusetts (high‑resolution aerial images). DSFC‑Net consistently outperformed state‑of‑the‑art methods such as ResU‑Net, D‑LinkNet, Swin‑Unet, BT‑RoadNet, and others. On the challenging WHU‑RuR+ dataset, DSFC‑Net achieved an F1‑score of 69.93 % and an IoU of 53.77 %, representing a notable gain over the best prior work (approximately 3–5 % absolute improvement). Similar gains were observed on DeepGlobe and Massachusetts, especially in scenarios with narrow, fragmented roads and heavy vegetation occlusion.

Ablation studies confirmed the contribution of each component: removing CFIA reduced IoU by ~2 %, omitting CFFM caused a ~1.5 % drop, and replacing the dual‑encoder with a single CNN or transformer degraded performance markedly. The model’s parameter count and FLOPs are comparable to existing transformer‑based approaches, indicating that the Laplacian‑pyramid‑based frequency decomposition is computationally efficient.

The authors release the source code and pretrained weights on GitHub, facilitating reproducibility and future research. They suggest extensions such as incorporating multispectral bands (e.g., NIR), lightweight variants for real‑time inference on edge devices, and integration with graph‑based road network reconstruction for downstream GIS applications.

In summary, DSFC‑Net demonstrates that explicitly modeling both spatial and frequency domains through a dual‑encoder architecture can effectively overcome the unique challenges of rural road extraction, delivering superior accuracy, robustness to occlusion, and preservation of narrow road connectivity compared with existing CNN or transformer‑only solutions.


Comments & Academic Discussion

Loading comments...

Leave a Comment