LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.


💡 Research Summary

The paper addresses two fundamental shortcomings of current open‑vocabulary semantic segmentation (OVSS) systems: the limited expressiveness of textual prompts and the insufficient pixel‑level detail in visual features derived from CLIP‑based vision‑language models. To overcome these issues, the authors propose LMSeg, a framework that synergistically combines large language models (LLMs) and the Segment‑Anything Model (SAM) with a learnable fusion strategy.

First, GPT‑4 is employed to generate rich, attribute‑level descriptions for each target class. By prompting the LLM to enumerate visual attributes such as color, shape, size, texture, material, position, pattern, action/state, and contextual relationships, the system obtains a set of nine candidate attributes per class. For each attribute, GPT‑4 produces a concise natural‑language sentence (≤ 77 tokens, matching CLIP’s tokenizer limit). These attribute‑specific sentences replace the conventional “a photo of {class}” template, yielding text embeddings that capture fine‑grained semantic nuances and reduce lexical ambiguity (e.g., distinguishing “bat” the animal from “bat” the sports equipment). An offline selection process evaluates the contribution of each attribute, and the top‑k attributes are combined into a final enriched prompt for each class.

Second, the visual side is enhanced by integrating SAM’s frozen image encoder with CLIP’s visual encoder. SAM provides high‑resolution, region‑aware features that complement CLIP’s globally‑biased embeddings. A lightweight adapter projects SAM features into the same dimensional space as CLIP features. A learnable weight generator, consisting of local and global branches, produces a per‑layer scalar α that balances the contributions of CLIP (α) and SAM (1‑α) features: Eₖ = α·F_c,ₖ + (1‑α)·F_s,ₖ. The fused representation Eₖ is processed by a Swin‑Transformer block to enrich spatial context, followed by a linear transformer that aligns class‑level textual information with each pixel. An up‑sampling stage then produces the final segmentation map.

Third, to mitigate the computational burden of handling a large number of class tokens, the authors introduce a Category Filtering Module (CFM). After computing the initial cost map M(i, j, n) as the cosine similarity between visual features and text embeddings, the module extracts the maximum response per class across spatial dimensions, selects the top‑q tokens, normalizes them, and recomputes a refined cost map M′ using only these filtered tokens. This reduces memory usage, accelerates inference, and suppresses noisy or irrelevant class activations.

Extensive experiments on benchmark datasets such as PC‑459, COCO‑Stuff, and PASCAL‑Context demonstrate that LMSeg consistently outperforms prior state‑of‑the‑art methods (e.g., ZegFormer, OV‑Seg, CAT‑Seg, SED). Notably, LMSeg achieves a new best mean Intersection‑over‑Union (mIoU) of 20.3% while maintaining lower inference latency, indicating suitability for real‑time applications. Ablation studies confirm that each component—LLM‑generated enriched prompts, SAM‑CLIP feature fusion, and category filtering—contributes significantly to the overall performance gain.

In summary, LMSeg establishes a novel paradigm for open‑vocabulary semantic segmentation by jointly leveraging large‑scale language understanding for richer textual cues and high‑resolution visual features from SAM, all integrated through a learnable, efficient fusion mechanism. This work highlights the importance of balanced multimodal refinement and sets a new benchmark for future research in open‑vocabulary visual perception.


Comments & Academic Discussion

Loading comments...

Leave a Comment