ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don’t explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable “Reference” prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.
💡 Research Summary
The paper addresses a critical issue that arises when applying the CLIP model to unsupervised semantic segmentation (USS): CLIP exhibits two systematic biases that degrade pixel‑level performance. The first, class‑preference bias, causes CLIP to confuse semantically related classes (e.g., “sheep” often being classified as “cow”). The second, space‑preference bias, makes CLIP far more accurate on objects near the image centre than on objects close to the borders. Existing CLIP‑based USS methods such as MaskCLIP and CLIP‑S4 do not explicitly model these biases, limiting their segmentation quality.
ReCLIP++ proposes a principled framework to detect and rectify both biases. For each target class the authors introduce two textual prompts: a fixed “Query” prompt that serves as the conventional CLIP classifier, and a learnable “Reference” prompt whose parameters are optimized during training. The Reference prompt is passed through CLIP’s frozen text encoder, producing a class‑level feature matrix (W_r) that captures class‑preference bias. Simultaneously, the positional embeddings of CLIP’s Vision Transformer are linearly projected to obtain a patch‑level positional feature matrix (W_p) that encodes space‑preference bias. By multiplying (W_r) (size (C\times D)) with the transpose of (W_p) (size (N\times D)), a bias logit map (M_b) of size (C\times N) is generated, where (C) is the number of classes and (N) the number of image patches.
The original CLIP segmentation logits are obtained in the usual way: the fixed Query prompt yields a weight matrix (W_q) that is multiplied with the visual patch features (Z) to produce the query logit map (M_q). Bias correction is performed by a simple element‑wise subtraction: (M = M_q - M_b). This operation directly removes the learned class‑ and spatial‑bias components from the raw CLIP predictions.
To turn the corrected logits into high‑quality masks, ReCLIP++ introduces a mask decoder. The decoder receives both the corrected logit map (M) and the visual feature map (Z) (concatenated along the channel dimension) and processes them through a series of convolutional and up‑sampling layers. The final output is passed through a Gumbel‑Softmax, which yields a differentiable approximation of a discrete segmentation mask. This design yields smoother boundaries and richer contextual consistency compared with the raw CLIP output.
Training is guided by a contrastive loss that operates on masked visual features. For each image a multi‑label hypothesis is generated; the corrected mask is applied to the visual features, producing class‑specific masked vectors. These vectors are then contrasted against the corresponding class text embeddings (from the Query prompt) using a temperature‑scaled dot‑product loss. The loss encourages masked visual features to be close to their true class embeddings while pushing them away from other class embeddings, thereby reinforcing the bias‑removal process. Importantly, CLIP’s weights remain frozen; only the Reference prompts, positional projection parameters, and decoder weights are updated.
The authors evaluate ReCLIP++ on five standard segmentation benchmarks: PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO‑Stuff. Across all datasets, ReCLIP++ outperforms previous CLIP‑based USS methods by a large margin (e.g., +15.4 % mIoU over MaskCLIP+ on VOC). Detailed analyses show that the space‑preference bias is effectively neutralized: the mIoU no longer drops with increasing object‑centroid distance from the image centre. Likewise, the class‑preference bias is mitigated, as evidenced by confusion matrices with dramatically reduced off‑diagonal entries. Ablation studies confirm that each component—Reference prompts, positional projection, logit subtraction, mask decoder, and contrastive loss—contributes substantially; removing any of them leads to notable performance degradation.
In summary, ReCLIP++ offers a clean, computationally inexpensive solution to a previously overlooked problem in CLIP‑based unsupervised segmentation. By explicitly modeling and subtracting bias, and by refining predictions with a dedicated decoder and contrastive supervision, the method achieves state‑of‑the‑art results without requiring any pixel‑level annotations or external distillation stages. The work opens avenues for extending bias‑rectification techniques to other vision‑language models and for incorporating multi‑scale spatial cues to further improve segmentation of peripheral objects.
Comments & Academic Discussion
Loading comments...
Leave a Comment