Samba+: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework
Existing salient object detection (SOD) models are generally constrained by the limited receptive fields of convolutional neural networks (CNNs) and quadratic computational complexity of Transformers. Recently, the emerging state-space model, namely Mamba, has shown great potential in balancing global receptive fields and computational efficiency. As a solution, we propose Saliency Mamba (Samba), a pure Mamba-based architecture that flexibly handles various distinct SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), RGB-D VSOD, and visible-depth-thermal SOD. Specifically, we rethink the scanning strategy of Mamba for SOD, and introduce a saliency-guided Mamba block (SGMB) that features a spatial neighborhood scanning (SNS) algorithm to preserve the spatial continuity of salient regions. A context-aware upsampling (CAU) method is also proposed to promote hierarchical feature alignment and aggregation by modeling contextual dependencies. As one step further, to avoid the “task-specific” problem as in previous SOD solutions, we develop Samba+, which is empowered by training Samba in a multi-task joint manner, leading to a more unified and versatile model. Two crucial components that collaboratively tackle challenges encountered in input of arbitrary modalities and continual adaptation are investigated. Specifically, a hub-and-spoke graph attention (HGA) module facilitates adaptive cross-modal interactive fusion, and a modality-anchored continual learning (MACL) strategy alleviates inter-modal conflicts together with catastrophic forgetting. Extensive experiments demonstrate that Samba individually outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost, whereas Samba+ achieves even superior results on these tasks and datasets by using a single trained versatile model. Additional results further demonstrate the potential of our Samba framework.
💡 Research Summary
The paper introduces a novel, unified framework for salient object detection (SOD) that leverages the recent state‑space model (SSM) known as Mamba. Existing SOD methods based on convolutional neural networks (CNNs) suffer from limited receptive fields, while transformer‑based approaches incur quadratic computational cost due to self‑attention. Mamba offers a linear‑complexity alternative that can capture long‑range dependencies, yet it has not been applied to SOD before.
The authors first propose Samba, a pure‑Mamba architecture designed to handle a wide range of SOD tasks (RGB, RGB‑D, RGB‑T, video SOD, RGB‑D video SOD, and visible‑depth‑thermal SOD). Two key innovations address the specific challenges of using Mamba for dense prediction:
-
Saliency‑guided Mamba Block (SGMB) – Traditional Mamba decoders flatten 2‑D feature maps into 1‑D sequences using fixed scanning orders (e.g., “Z”, “S”, diagonal). This destroys the spatial continuity of salient patches, which is crucial for accurate segmentation. SGMB introduces a Spatial Neighborhood Scanning (SNS) algorithm that dynamically generates scanning paths so that neighboring patches in the image remain neighboring in the sequence. By preserving spatial continuity, the SSM can exploit its causal modeling while still benefiting from global context.
-
Context‑aware Upsampling (CAU) – Existing decoders upsample low‑resolution features with nearest‑neighbor interpolation before merging with high‑resolution features, leading to misalignment and a lack of learnability. CAU pairs patches from shallow and deep layers into subsequences, concatenates them, and feeds the combined sequence to the SSM. The causal nature of the SSM enables deep features to learn the distribution of shallow features and expand to the same spatial shape, effectively aligning hierarchical representations and modeling inter‑level contextual dependencies.
Building on Samba, the authors develop Samba+, a single model jointly trained on all six SOD tasks. To make multi‑modal, multi‑task training feasible, two additional components are introduced:
-
Hub‑and‑Spoke Graph Attention (HGA) – A learnable hub node aggregates information from modality‑specific “spokes”. This dynamic graph attention replaces task‑specific fusion modules, allowing the same architecture to process any combination of modalities (single‑modal or multi‑modal) with minimal parameter overhead.
-
Modality‑Anchored Continual Learning (MACL) – MACL maintains modality‑specific anchor parameters and adds a KL‑divergence regularization that penalizes deviation from previously learned modality distributions. This mitigates inter‑modality interference and catastrophic forgetting, enabling stable joint optimization in both task‑incremental and modality‑incremental scenarios.
Extensive experiments on 22 public datasets covering the six SOD tasks demonstrate that:
- Samba outperforms state‑of‑the‑art (SOTA) methods on each individual task while using fewer FLOPs and parameters, confirming the efficiency of the Mamba backbone combined with SGMB and CAU.
- Samba+ achieves equal or superior performance to the best task‑specific models across all tasks, despite being a single trained network. The advantage is especially pronounced for complex modality combinations such as visible‑depth‑thermal (VDT) and RGB‑D video SOD, where previous prompt‑based or modality‑specific fusion approaches suffer large drops.
- Additional evaluations on tasks emphasizing spatial continuity—camouflaged object detection and skin lesion segmentation—show that the SNS‑driven SGMB indeed preserves spatial coherence, leading to higher-quality masks.
In summary, the paper pioneers the application of state‑space modeling to dense visual prediction, proposes a set of principled design choices (SNS, SGMB, CAU, HGA, MACL) that together deliver a highly efficient and accurate SOD solution, and demonstrates that a truly unified model can replace a plethora of task‑specific architectures. The approach opens avenues for extending Mamba‑based designs to other segmentation problems, medical imaging, and robotics where multi‑modal, real‑time performance is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment