TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models
Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity, they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that \textbf{couples} the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
💡 Research Summary
TAViS tackles the core difficulty of audio‑visual segmentation (AVS): aligning the semantic content of audio and visual streams so that the sounding objects can be precisely segmented at the pixel level. Existing AVS methods either rely on a single‑modality foundation model (e.g., a visual SAM or an audio model) or combine multiple foundation models in an off‑the‑shelf fashion, which leaves the cross‑modal alignment problem unsolved and limits performance, especially under data‑scarce conditions.
The proposed framework strategically couples two state‑of‑the‑art foundation models: ImageBind, a multimodal model that learns a shared embedding space for image, text, audio, and video, and SAM2, a segmentation model that excels at generating accurate masks given diverse prompts. Directly merging these models is non‑trivial because their feature spaces differ. TAViS resolves this by introducing text as an intermediate bridge, which serves both as a prompt generator and as a supervision signal.
Key components:
-
ImageBind‑Guided Query Decomposition (IBQD).
- The raw audio signal is encoded by ImageBind’s audio encoder, yielding a trunk feature fₐ and a class token tₐ.
- A set of learnable queries t_W (N of them) interacts with fₐ via multi‑head cross‑attention, producing object‑level biases that are added to tₐ. The result, t′ₐ, retains ImageBind’s well‑aligned audio‑visual semantics while providing the object‑centric queries required by SAM2.
-
Text‑Bridged Hybrid Prompting.
- Sparse Prompt: t′ₐ and tₐ are transformed by an MLP and fed into ImageBind’s text encoder to obtain a pseudo‑text embedding p_t that carries class‑prototype information (e.g., “A dog”). Simultaneously, an audio‑only prompt p_a is generated from the same audio tokens via another MLP. The two embeddings are concatenated and passed through a final MLP to produce the sparse prompt p supplied to SAM2’s mask decoder.
- Dense Prompt: The visual frame is processed by ImageBind’s image encoder to obtain an image class token, which is tiled across all pixel locations and injected as a dense prompt t_v into SAM2’s decoder. This aligns the visual features with the same textual semantics used for audio.
-
Text‑Bridged Alignment Supervision.
- Audio‑to‑Text loss (Lₐ₂t): The pseudo‑text embedding derived from the audio queries is forced to match the embedding of the ground‑truth textual label (template “A
Comments & Academic Discussion
Loading comments...
Leave a Comment