SAM3-I: Segment Anything with Instructions

SAM3-I: Segment Anything with Instructions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3’s existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.


💡 Research Summary

Segment Anything Model 3 (SAM 3) introduced Promptable Concept Segmentation (PCS), allowing users to segment all instances of a short noun‑phrase (e.g., “soccer player”). While this was a major step toward open‑vocabulary segmentation, real‑world interactions often involve richer natural‑language instructions that combine attributes, spatial relations, functions, actions, states, and implicit reasoning. The current SAM 3 solution handles such instructions by delegating them to external multimodal large language models (MLLMs) that rewrite the long instruction into a short noun‑phrase and then perform iterative mask filtering. This pipeline adds computational overhead, separates linguistic reasoning from visual segmentation, and, most importantly, collapses detailed instructions into overly coarse concepts that cannot precisely ground a specific target instance.

The paper proposes SAM 3‑I, an enhanced framework that unifies concept‑level understanding and instruction‑level reasoning within the SAM family. SAM 3‑I introduces Promptable Instruction Segmentation (PIS), a new paradigm that directly accepts three hierarchical instruction types: (i) Concept instructions (short noun‑phrases, identical to PCS), (ii) Simple instructions (noun‑phrase plus explicit attributes, spatial cues, or local context), and (iii) Complex instructions (no explicit noun‑phrase; the model must infer the target from functional descriptions, actions, affordances, or multi‑step reasoning). To support this hierarchy without altering the original SAM 3 backbone, the authors design an Instruction‑aware Cascaded Adapter inserted into each layer of the text encoder. The cascade consists of two lightweight adapters:

  • S‑Adapter – learns to encode attribute, position, and relation semantics, handling simple instructions where the target noun‑phrase is still present.
  • C‑Adapter – builds on the S‑Adapter and is responsible for complex instructions that lack an explicit noun‑phrase, requiring multi‑hop or contextual reasoning.

Both adapters use a bottleneck (down‑projection → ReLU → up‑projection) and incorporate a multi‑head self‑attention (MHSA) block to capture long‑range textual dependencies. Because the SAM 3 visual encoder and detector remain frozen, the adapters inject instruction‑following capability in a parameter‑efficient manner, preserving the strong concept recall of the original model.

Training proceeds in three curriculum stages:

  1. Stage 1 – Simple‑instruction learning – SAM 3 is frozen; only the S‑Adapter is trained on a dataset of simple instructions, establishing grounding of categories, attributes, and spatial relations.

  2. Stage 2 – Complex‑instruction reasoning – The S‑Adapter is fine‑tuned, and the C‑Adapter (initialized from the S‑Adapter) is trained on complex instructions, endowing the system with functional and reasoning‑level grounding.

  3. Stage 3 – Joint alignment refinement – Both adapters are activated together and jointly fine‑tuned using two alignment losses in addition to the original segmentation loss:

    • Distribution Alignment (KL divergence) forces the mask probability distributions from the simple‑branch (p_simple) and complex‑branch (p_complex) to be consistent, anchoring the new branches to SAM 3’s original detector.
    • Uncertainty‑aware Hard‑Region Supervision computes a per‑pixel uncertainty map based on Jensen–Shannon divergence between the two branches; this map weights an auxiliary cross‑entropy loss, encouraging the model to focus on ambiguous or reasoning‑intensive regions (e.g., occlusions, relational cues).

The overall training objective is L_train = L_seg + L_align + L_hard. This progressive curriculum mirrors the natural hierarchy of linguistic difficulty, stabilizes optimization, and prevents catastrophic forgetting of the original concept‑driven abilities.

A major contribution is the scalable data construction pipeline that transforms existing open‑vocabulary segmentation datasets into a rich instruction‑mask corpus. The pipeline has three stages:

  • Automatic instruction generation – A first MLLM produces multiple positive and negative instructions per target instance, covering both declarative and question formats, and spanning the three taxonomy levels (concept, simple, complex). Positive instructions describe the target using attributes, relations, or functional cues; negative instructions deliberately contradict visual semantics to act as contrastive distractors.
  • Agentic quality inspection – A second MLLM evaluates each generated instruction, filtering out incoherent or mismatched pairs.
  • Human‑assisted correction – Human annotators resolve the remaining ambiguous cases, ensuring high fidelity between language and mask.

The resulting dataset contains thousands of instruction‑mask pairs per image, with four positive and four negative variants for each of the simple and complex levels, providing the diversity needed for robust instruction following.

Experiments evaluate SAM 3‑I on standard referring segmentation benchmarks (RefCOCO, RefCOCO+, RefCOCOg) and on a newly curated set of complex instructions. Compared to the baseline SAM 3 + external agents, SAM 3‑I achieves 5–12 % higher Intersection‑over‑Union (IoU) and accuracy across all metrics, while eliminating the need for multi‑step agent calls and reducing inference latency. Ablation studies confirm that both adapters, the alignment losses, and the curriculum stages contribute positively to performance. The authors also release lightweight fine‑tuning scripts that enable domain‑specific adaptation (e.g., medical imaging, industrial inspection) with minimal computational resources.

In summary, SAM 3‑I demonstrates that open‑vocabulary segmentation models can be extended to follow rich natural‑language instructions without sacrificing their original concept‑grounding strength. The hierarchical adapter design, curriculum‑based training, and large‑scale instruction‑centric data pipeline together provide a practical, efficient, and scalable solution for instruction‑driven visual understanding, opening avenues for applications such as home robotics, autonomous driving, and augmented reality where systems must interpret and act upon complex linguistic commands.


Comments & Academic Discussion

Loading comments...

Leave a Comment