Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data
Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.
💡 Research Summary
The paper addresses two major drawbacks of current Segment Anything Models (SAM): the need for massive pre‑training datasets and the exclusive reliance on RGB inputs, which limits performance in texture‑poor or ambiguous boundary regions. While recent lightweight variants such as FastSAM, MobileSAM, and EfficientViT‑SAM reduce computational cost, they still require large‑scale training and do not exploit geometric cues.
To overcome these issues, the authors propose Depth‑Aware EfficientViT‑SAM, a lightweight RGB‑D fusion framework built on top of EfficientViT‑SAM. The pipeline works as follows: (1) a pretrained monocular depth estimator (DepthAnything) generates a depth map for each input image; (2) the depth map is duplicated across three channels and fed into a dedicated depth encoder that mirrors the EfficientViT architecture (early Conv‑Block → MBConv → EfficientViT modules); (3) the RGB encoder (the original EfficientViT‑SAM encoder) and the depth encoder produce mid‑level feature tensors F_rgb and F_depth; (4) these tensors are fused by a simple additive operation with a learnable scaling factor α: F_fuse = F_rgb + α·F_depth. This design keeps the dominant RGB representation while allowing the network to adaptively leverage depth cues. The fused embedding is then passed unchanged to SAM’s existing prompt encoder and mask decoder, preserving the interactive segmentation capabilities (point, box, text prompts).
The loss function combines binary cross‑entropy (BCE) and Dice loss (λ_mask = 20, λ_dice = 1) with three auxiliary terms: an IoU regression loss (λ_iou = 1), a direct supervision loss on intermediate predictions (λ_direct = 0.5), and a boundary‑focused loss (λ_aux = 0.2). This multi‑task objective encourages pixel‑wise accuracy, overlap quality, reliable confidence estimation, and fine‑grained boundary precision.
Training proceeds in two stages. First, the depth encoder is randomly initialized and trained alone for two epochs on a tiny subset of SA‑1B (11.2 k images). Second, the whole network is fine‑tuned end‑to‑end for four epochs using the full composite loss. The batch size is 4, optimizer AdamW (β1 = 0.9, β2 = 0.999), and training completes in under five hours on two NVIDIA A6000 GPUs.
Experimental results demonstrate that the proposed model, despite roughly doubling the parameter count (61.3 M → 118.7 M) and FLOPs (69 G → 137 G), remains far more lightweight than the original SAM‑ViT‑H (>600 M parameters, ~3000 G FLOPs). On a single A6000 GPU (fp16, no TensorRT) it processes 31.9 images/s for 1024×1024 inputs, compared to 62.8 img/s for the baseline EfficientViT‑SAM‑L2.
Quantitatively, on COCO and LVIS benchmarks the depth‑aware model consistently outperforms EfficientViT‑SAM‑L2 across box‑prompted and point‑prompted settings. For instance, with three point clicks the model achieves 71.6 % mIoU on COCO (versus 67.2 % for the baseline) and 70.4 % on LVIS. Notably, gains are most pronounced for small objects, where depth provides crucial separation cues. When evaluated with automatically generated boxes from ViTDet, YOLOv8, and Grounding‑DINO, the depth‑aware variant still shows higher mean IoU, confirming robustness to imperfect prompts.
Qualitative visualizations illustrate sharper object boundaries and better handling of texture‑less regions, confirming that monocular depth priors act as strong geometric regularizers.
The authors acknowledge limitations: the depth encoder roughly doubles model size and computational load, and performance depends on the quality of the monocular depth estimator, which may degrade under challenging lighting or reflective surfaces. Future work could explore more parameter‑efficient depth encoders, multi‑scale depth feature integration, or uncertainty‑aware depth modeling to mitigate these issues.
In summary, the paper demonstrates that integrating monocular depth cues into a lightweight SAM backbone yields substantial accuracy improvements while requiring only a fraction (≈0.1 %) of the original training data. This work validates depth‑aware fusion as an effective strategy for data‑efficient, geometry‑aware segmentation, opening avenues for deploying SAM‑style models on edge devices and in domains where annotated data are scarce.
Comments & Academic Discussion
Loading comments...
Leave a Comment