vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition

vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Capturing long-range dependencies (LRD) efficiently is a core challenge in visual recognition, and state-space models (SSMs) have recently emerged as a promising alternative to self-attention for addressing it. However, adapting SSMs into CNN-based bottlenecks remains challenging, as existing approaches require complex pre-processing and multiple SSM replicas per block, limiting their practicality. We propose vGamba, a hybrid vision backbone that replaces the standard bottleneck convolution with a single lightweight SSM block, the Gamba cell, which incorporates 2D positional awareness and an attentive spatial context (ASC) module for efficient LRD modeling. Results on diverse downstream vision tasks demonstrate competitive accuracy against SSM-based models such as VMamba and ViM, while achieving significantly improved computation and memory efficiency over Bottleneck Transformer (BotNet). For example, at $2048 \times 2048$ resolution, vGamba is $2.07 \times$ faster than BotNet and reduces peak GPU memory by 93.8% (1.03GB vs. 16.78GB), scaling near-linearly with resolution comparable to ResNet-50. These results demonstrate that Gamba Bottleneck effectively overcomes the memory and compute constraints of BotNet global modeling, establishing it as a practical and scalable backbone for high-resolution vision tasks.


💡 Research Summary

The paper addresses the long‑standing challenge of efficiently capturing long‑range dependencies (LRD) in visual recognition. While convolutional neural networks (CNNs) excel at local feature extraction, their limited receptive fields hinder global context modeling. Vision Transformers (ViT) overcome this with self‑attention (SA) but suffer from quadratic computational and memory complexity with respect to the number of image patches, making them impractical for high‑resolution inputs. Recent work has shown that state‑space models (SSMs), particularly the Mamba architecture, can model LRD with linear complexity in sequence length, offering a promising alternative to SA. However, existing visual SSM adaptations such as VMamba and ViM require multiple SSM replicas per block (four or two respectively) and complex preprocessing (cross‑scan, cross‑merge, bidirectional passes). This adds considerable architectural overhead and memory consumption, preventing a drop‑in replacement of the standard 3×3 convolution in CNN bottlenecks.

The authors propose vGamba, a hybrid backbone that replaces the 3×3 convolution in the final ResNet bottleneck stage with a single lightweight SSM block called the Gamba cell. The Gamba cell integrates three key components:

  1. 2‑D Relative Positional Embedding (RPE) – After reshaping the feature map B×C×H×W into a sequence B×(HW)×C, separate row‑wise and column‑wise positional encodings are summed and added to the token embeddings. This restores spatial awareness lost during flattening and guides the causal SSM scan with explicit 2‑D location cues.

  2. Mamba‑style SSM – The core SSM follows the original Mamba formulation, using dynamic B, C parameters and the selective‑scan implementation. It processes the position‑enriched sequence in a causal manner, achieving O(M) time and memory where M = H·W.

  3. Attentive Spatial Context (ASC) module – To compensate for the lack of bidirectional interaction inherent in a causal scan, ASC performs separate pooling along height and width, applies per‑channel learnable gates (α) and axis‑wise biases (b_h, b_w), and fuses the two streams. This yields channel‑specific, asymmetric weighting of horizontal and vertical context, enriching the SSM output with fine‑grained local information without re‑running the SSM.

Memory efficiency is further boosted by the IO‑aware selective scan, which keeps state updates in SRAM and writes back only the final activations, reducing bandwidth from O(B·M·E·N) to O(B·M·E + E·N). Because the Gamba cell is placed only in the fourth stage (where spatial resolution is 32×32 for a typical 224×224 input), the sequence length remains modest, allowing linear‑complexity SSM processing without exploding memory.

Experimental results span ImageNet‑1K classification, COCO object detection, and ADE20K semantic segmentation. vGamba‑B (the Gamba‑cell version of a ResNet‑50‑style backbone) matches or exceeds the accuracy of VMamba‑B and ViM‑B (+1.8% and +0.6% Top‑1 respectively) while using fewer parameters (6.11 M and 2.60 M fewer) and lower FLOPs (≈0.3 G and 0.2 G fewer). At ultra‑high resolution (2048×2048), vGamba is 2.07× faster than BotNet and reduces peak GPU memory from 16.78 GB to 1.03 GB, a 93.8% saving. Ablation studies confirm that removing RPE degrades global reasoning, while omitting ASC harms local detail recovery; increasing the number of SSM replicas recovers performance but dramatically raises memory and compute cost.

In summary, vGamba demonstrates that a single, lightweight SSM block equipped with 2‑D positional encoding and a modest attention‑like refinement module can serve as a drop‑in replacement for the conventional convolutional bottleneck. It delivers near‑CNN scaling efficiency, linear‑time global context modeling, and substantial memory savings, making it a practical backbone for high‑resolution vision tasks. The work positions SSMs as a viable, more efficient alternative to self‑attention in modern visual architectures and opens avenues for further scaling and integration into diverse computer‑vision pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment