MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba’s latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.

💡 Research Summary

The paper addresses the challenging problem of pixel‑level road crack segmentation, where fine textures, thin structures, and irregular geometries make accurate detection difficult. Existing approaches fall into three families: CNN‑based models excel at local texture extraction but suffer from limited receptive fields; Transformer‑based models capture long‑range dependencies through self‑attention but incur quadratic computational cost; and Mamba‑based models provide linear‑complexity global context yet struggle to fully exploit it in a single forward pass. To bridge these gaps, the authors propose MixerCSeg, a hybrid “mixer” architecture that integrates the strengths of all three paradigms within a single encoder.

The core component, TransMixer, is built upon an analytical decomposition of Mamba’s hidden‑state attention. By sorting the token‑wise decay factor Δt across the channel dimension, the model separates the top‑γ fraction of channels as “global tokens” and the remainder as “local tokens.” Global tokens are fed into a conventional self‑attention block, enriching long‑range correlations, while local tokens undergo a lightweight Local Refinement module consisting of a 1×1 convolution, max‑pooling, and a sigmoid gate. This decoupling preserves Mamba’s linear complexity while granting the encoder explicit pathways for both global and fine‑grained information.

To further improve edge sensitivity, the Direction‑guided Edge Gated Convolution (DEGConv) module partitions feature maps into non‑overlapping views, computes Sobel‑based horizontal and vertical gradients, and converts them into orientation histograms per spatial cell. Directional embeddings are then injected via horizontal (1×k) and vertical (k×1) convolutions, followed by a gating mechanism that selectively amplifies edge‑relevant features. This design specifically targets the multi‑directional branching patterns typical of cracks, enhancing boundary delineation without a substantial increase in FLOPs.

Finally, the Spatial Refinement multi‑level Fusion (SRF) module aggregates multi‑scale features from the encoder. Low‑resolution, globally‑aware representations are up‑sampled and concatenated with high‑resolution local features, then refined through a simple 1×1 convolution. The SRF avoids additional attention layers, keeping the overall computational budget low while still providing cross‑scale semantic guidance.

Extensive experiments on several public crack segmentation benchmarks (e.g., Crack500, DeepCrack) demonstrate that MixerCSeg achieves state‑of‑the‑art performance, improving mean IoU and F1 scores by 1–2 percentage points over strong baselines such as CrackFormer, SCSegamba, and RestorMixer. Remarkably, this performance is obtained with only 2.05 GFLOPs and 2.54 M parameters, enabling real‑time inference on edge devices.

In summary, the paper’s contributions are threefold: (1) a novel token‑level decoupling strategy (TransMixer) that leverages Mamba’s latent attention to provide dedicated global and local pathways; (2) a direction‑aware edge gating mechanism (DEGConv) that embeds geometric priors to better capture irregular crack structures; and (3) an efficient multi‑scale fusion block (SRF) that refines spatial details without extra cost. Together, these innovations constitute a compelling solution that advances both accuracy and efficiency in crack segmentation, and the released code promises reproducibility and practical adoption.

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

💡 Research Summary

Comments & Academic Discussion

Leave a Comment