DAS-SK: An Adaptive Model Integrating Dual Atrous Separable and Selective Kernel CNN for Agriculture Semantic Segmentation
Semantic segmentation in high-resolution agricultural imagery demands models that strike a careful balance between accuracy and computational efficiency to enable deployment in practical systems. In this work, we propose DAS-SK, a novel lightweight architecture that retrofits selective kernel convolution (SK-Conv) into the dual atrous separable convolution (DAS-Conv) module to strengthen multi-scale feature learning. The model further enhances the atrous spatial pyramid pooling (ASPP) module, enabling the capture of fine-grained local structures alongside global contextual information. Built upon a modified DeepLabV3 framework with two complementary backbones - MobileNetV3-Large and EfficientNet-B3, the DAS-SK model mitigates limitations associated with large dataset requirements, limited spectral generalization, and the high computational cost that typically restricts deployment on UAVs and other edge devices. Comprehensive experiments across three benchmarks: LandCover.ai, VDD, and PhenoBench, demonstrate that DAS-SK consistently achieves state-of-the-art performance, while being more efficient than CNN-, transformer-, and hybrid-based competitors. Notably, DAS-SK requires up to 21x fewer parameters and 19x fewer GFLOPs than top-performing transformer models. These findings establish DAS-SK as a robust, efficient, and scalable solution for real-time agricultural robotics and high-resolution remote sensing, with strong potential for broader deployment in other vision domains.
💡 Research Summary
The paper introduces DAS‑SK, a lightweight yet highly accurate semantic segmentation network designed for high‑resolution agricultural imagery. Recognizing the trade‑off between the efficiency of CNNs and the global context modeling of Transformers, the authors propose a novel module—DAS‑SKConv—that fuses Dual Atrous Separable Convolution (DAS‑Conv) with Selective Kernel Convolution (SK‑Conv). DAS‑Conv consists of two parallel branches: a depthwise‑separable atrous branch for efficient channel mixing and a standard atrous branch for expanding the receptive field. Their concatenated output is fed into SK‑Conv, which learns adaptive channel‑wise attention across multiple receptive fields, thereby dynamically emphasizing the most informative scale for each spatial location.
The architecture adopts a dual‑backbone encoder: MobileNetV3‑Large (≈2.9 M parameters) serves as the primary backbone, while a truncated EfficientNet‑B3 (≈2.3 M parameters) provides complementary features. Features from both backbones (total 960 channels) are processed by an enhanced Atrous Spatial Pyramid Pooling (ASPP) module. This ASPP contains a 1×1 convolution for channel reduction, six DAS‑SKConv blocks with dilation rates {4, 8, 12, 18, 22, 26}, and a strip‑pooling branch that captures long‑range horizontal and vertical dependencies—particularly useful for structured agricultural patterns. The concatenated ASPP output is compressed and passed to a hierarchical decoder that employs skip connections, separable convolutions, and progressive up‑sampling to recover fine‑grained spatial detail.
Training uses AdamW with cosine annealing, and performance is measured by mean Intersection‑over‑Union (mIoU) and a composite “Efficiency” metric defined as ΔmIoU·log(Params)/GFLOPs·100 %. Experiments on three public benchmarks—LandCover.ai (5 classes, 512×512), VDD (7 classes, 4000×3000), and PhenoBench (3 classes, 1024×1024)—show that DAS‑SK consistently outperforms state‑of‑the‑art CNN, Transformer, and hybrid models. It achieves mIoU scores of 78.4 %, 71.2 %, and 69.5 % respectively, while using only about 5.2 M parameters and 1.3 G FLOPs. Compared to the best Transformer‑based competitor, DAS‑SK reduces parameter count by up to 21× and FLOPs by up to 19×, yet still delivers higher segmentation accuracy. Ablation studies confirm that the SK‑Conv attention contributes an average 2 % mIoU gain over DAS‑Conv alone, highlighting the importance of adaptive multi‑scale channel weighting.
Real‑time inference tests on edge hardware (NVIDIA Jetson Nano, Raspberry Pi 4) demonstrate frame rates of 15 FPS and 12 FPS respectively, confirming suitability for UAV‑mounted or robotic platforms that require on‑board processing. The authors acknowledge current limitations, such as reliance on RGB inputs and the need for validation on multispectral or hyperspectral data. Future work will explore multimodal extensions, temporal modeling, and lightweight post‑processing (e.g., CRF) to further refine boundary accuracy. Overall, DAS‑SK offers a compelling balance of accuracy, efficiency, and scalability, positioning it as a practical solution for precision agriculture and potentially other high‑resolution remote‑sensing domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment