시각언어 모델 편향 제거를 위한 서브스페이스 투영 기법

Reading time: 5 minute
...

📝 Abstract

Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical failures of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S} $ubspace $\textbf{P} $rojection $\textbf{D} $ebiasing ( $\textbf{SPD} $), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5\%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.

💡 Analysis

Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical failures of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S} $ubspace $\textbf{P} $rojection $\textbf{D} $ebiasing ( $\textbf{SPD} $), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5\%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.

📄 Content

Vision-Language Models (VLMs) have rapidly become central to modern multimodal AI, powering image-text retrieval, visual question answering, image captioning, and text-to-image generation [22,27,30,35,50]. Their broad generalization and emergent cross-modal alignment make them indispensable foundation models [2,8,27,28,31,47]. However, an expanding body of work shows that VLMs also inherit and amplify demographic and social biases present in large-scale web data [15,16,23,51]. These biases appear as gendered profession associations [11,18,43,46], * Equal contribution. Code available at this GitHub repository. racially skewed retrieval rankings [15,26,32], or stereotypical captions [6,36,51,52], thereby undermining fairness, reliability, and the trustworthiness of model predictions [53]. Beyond ethical concerns, biased internal representations induce spurious correlations that degrade robustness and cross-domain generalization, as models exploit socially biased features rather than semantically relevant features [12,38,44,45]. Thus, effective debiasing is essential not only for social responsibility but also for preserving the generalization integrity of multimodal systems.

Existing efforts to debias VLMs follow two lines. Training-based methods fine-tune models to suppress sensitive attributes, which can reduce bias but are computationally heavy, sensitive to hyperparameters, and often tailored to binary attributes with limited transferability. Posthoc methods operate on frozen embeddings and avoid finetuning costs, yet they lack a unified understanding of how bias is represented within VLM embeddings and often rely on inconsistent or fragile assumptions about its underlying geometric structure in high-dimensional spaces. Despite these efforts, how to properly conceptualize and model the geometric structure of bias in multimodal representations remains an open question.

Recent work by Jung et al. [24] introduces Selective Feature Imputation for Debiasing (SFID), a post-hoc method that identifies the embedding dimensions most predictive of sensitive attributes and replaces them with neutral values derived from low-confidence samples. This coordinatewise approach is model-agnostic, training-free, and uniformly applicable across VLM components, achieving fairness improvements on multiple benchmarks. Yet SFID’s design implicitly assumes bias is localized within a small subset of coordinates, the same dimensions encode a given attribute across different datasets, and replacing these coordinates does not discard semantically relevant information.

However, our systematic reproduction study (Sec. 3) reveals that these assumptions do not hold in practice. We find that the most representative embedding dimensions for different attributes exhibit substantial overlap. This leads to feature entanglement, as removing one attribute’s dimensions unintentionally distorts representations of others. We show that the indices of important dimensions for specific attributes shift across datasets, undermining SFID’s crossdataset transferability. Finally, we demonstrate that replacing the top-m most important dimension coordinates leaves measurable residual bias, as attribute information is distributed sparsely across far more dimensions than the imputation process targets [3].

Building on these findings, we propose a subspaceprojection debiasing framework, Subspace Projection Debiasing (SPD), that directly addresses SFID’s limitations by moving beyond its coordinate-wise editing toward a continuous and geometrically principled operation. We explicitly learn bias directions using Iterative Null-space Projection (INLP) [37] and project embeddings onto their orthogonal complement, thereby removing attribute-specific components. This approach is more robust and achieves more thorough debiasing than simple coordinate-level interventions. To preserve semantic fidelity, we reinsert a neutral mean from low-confidence samples, which recenters the embeddings without reintroducing attribute-specific variance. This stabilization mitigates overcorrection and improves generalization across datasets and downstream tasks.

We evaluate our framework across three representative downstream tasks: multi-class zero-shot classification, textto-image retrieval, and text-to-image generation, using multiple VLM backbones. Empirical results show consistently lower demographic-parity gaps and misclassification disparities than SFID and other baselines, while maintaining comparable or higher accuracy and perceptual quality.

Our study provides a unified, VLM-training-free, and interpretable approach to post-hoc debiasing of VLMs that supports multi-attribute bias mitigation and advances both fairness and generalization across modalities and tasks.

Training-based Debiasing Methods Prior work on VLM debiasing includes some methods that rely on additional training or fine-tuning to suppress sensitive attribute information from learned representations [1,4

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut