Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation
This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.
💡 Research Summary
**
The paper introduces MeCSAFNet (Multi‑encoder ConvNeXt Network with Smooth Attentional Feature Fusion), a novel architecture designed specifically for land‑cover segmentation of multispectral remote‑sensing imagery. The authors argue that most existing methods treat all spectral bands as a single homogeneous input, ignoring the distinct characteristics of visible (RGB) and non‑visible (NIR, derived indices such as NDVI and NDWI) channels. To address this, MeCSAFNet splits the input into two modality‑specific streams: one for visible bands and another for non‑visible bands. Each stream is processed by an independent ConvNeXt encoder (Tiny, Small, Base, or Large variants), leveraging ConvNeXt’s modern design that blends convolutional efficiency with transformer‑inspired components (depth‑wise convolutions, LayerNorm, and a GELU‑like activation).
After encoding, each branch feeds a dedicated decoder that reconstructs spatial details through skip connections and progressive up‑sampling. The core of the network is a separate “Fusion Decoder” that aggregates multi‑scale feature maps from both branches in a pyramid fashion. At each scale, a Convolutional Block Attention Module (CBAM) is applied to model channel‑wise and spatial attention jointly, allowing the network to dynamically emphasize complementary information from the two spectral groups while suppressing irrelevant activations. This attention‑guided fusion is more expressive than naïve concatenation or averaging, as it captures complex inter‑spectral relationships essential for distinguishing spectrally similar land‑cover classes.
To further improve training stability, the authors replace the standard activation function with ASAU (Adaptive Smooth Activation Unit). ASAU provides a smoother transition in the activation space, mitigating gradient explosion or vanishing problems that can arise when processing high‑dimensional multispectral data, especially when additional derived indices are included. Empirical results show that ASAU contributes to faster convergence and higher final mIoU, particularly in the 6‑channel configuration (RGB + NIR + NDVI + NDWI).
The method is evaluated on two large‑scale benchmarks: Five‑Billion‑Pixels (FBP), a global dataset comprising billions of pixels across diverse biomes, and the ISPRS Potsdam dataset, a high‑resolution urban scene collection. Experiments are conducted with both 4‑channel (RGB + NIR) and 6‑channel inputs. On FBP, MeCSAFNet‑Base (6c) outperforms U‑Net (4c) by +19.21% mIoU, U‑Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74%. On Potsdam, MeCSAFNet‑Large (4c) surpasses DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU.
The authors also present lightweight variants (Tiny, Small) that reduce parameters and FLOPs by 30‑45% while still delivering 2‑4 percentage‑point gains over comparable lightweight baselines. Training time is cut by roughly 20 % and inference speed reaches near‑real‑time performance on modern GPUs, making the models suitable for deployment on resource‑constrained platforms such as edge devices or satellite onboard processors.
A thorough literature review situates MeCSAFNet among prior dual‑branch and multi‑scale approaches, highlighting that many earlier works either duplicated generic backbones without spectral specialization or relied on simple fusion mechanisms. By combining modality‑specific ConvNeXt encoders, attention‑enhanced multi‑scale fusion, and a smooth activation function, MeCSAFNet demonstrates a balanced trade‑off between accuracy and efficiency.
The paper concludes with a discussion of limitations (e.g., increased architectural complexity compared to single‑branch models, potential sensitivity to the choice of derived indices) and suggests future directions such as incorporating self‑supervised pre‑training on large unlabeled multispectral archives, exploring dynamic branch weighting, and extending the framework to hyperspectral data. All code and training scripts are released publicly, ensuring reproducibility and encouraging further research in multispectral semantic segmentation.
Comments & Academic Discussion
Loading comments...
Leave a Comment