Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs

Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.


💡 Research Summary

The paper introduces NH‑Fair, a unified benchmark designed to evaluate “fairness without harm” across both conventional vision models and large vision‑language models (LVLMs). The authors identify three major obstacles in current bias‑mitigation research: heterogeneous datasets, inconsistent fairness metrics, and isolated evaluation of vision versus multimodal models, all compounded by insufficient hyper‑parameter tuning that obscures true performance differences. NH‑Fair addresses these gaps by providing a standardized suite of seven publicly available image‑based datasets (CelebA, UTKFace, FairFace, Facet, HAM10000, Fitz17k, Waterbirds) and a common set of twelve bias‑mitigation methods, split into data‑centric (e.g., RandAugment, Mixup, Resampling, Bias Mimicking, FIS) and algorithmic (e.g., Decoupled Classifier, LAFTR, FSCL, GapReg, MCDP, GroupDRO, DFR, OxonFair) categories.

A central conceptual contribution is the formalization of “fairness without harm,” which augments traditional group‑fairness constraints (overall accuracy parity, demographic parity, equalized odds, max‑min fairness) with a no‑harm condition: for every protected group, the risk of the fairness‑enhanced model must not exceed that of an unconstrained empirical risk minimization (ERM) baseline. This ensures that fairness interventions do not degrade any group’s performance, aligning with ethical principles of beneficence and non‑maleficence.

The first set of experiments conducts an exhaustive ERM hyper‑parameter sweep, revealing that optimizer choice (SGD, Adam, AdamW, Adagrad) and learning‑rate have the largest impact on both utility and disparity, while model depth, batch size, and weight decay are comparatively minor. This finding demonstrates that much of the perceived advantage of sophisticated bias‑mitigation methods may actually stem from under‑tuned ERM baselines.

Subsequently, each of the twelve mitigation techniques is evaluated under the same training protocol and hyper‑parameter budget. The results show that most algorithmic methods fail to consistently outperform a well‑tuned ERM. In contrast, a composite data‑augmentation strategy—combining diverse augmentations and balanced resampling—consistently improves group parity while preserving or even enhancing overall accuracy. The authors thus recommend prioritizing data‑centric approaches before exploring more complex algorithmic interventions, especially when computational resources are limited.

The benchmark also extends to LVLMs, assessing models such as LLaVA and MiniGPT‑4 in both supervised fine‑tuning and zero‑shot settings. Although LVLMs achieve higher average accuracy due to massive pre‑training corpora, they still exhibit notable subgroup disparities. Moreover, scaling model size yields only modest fairness gains; architectural choices (e.g., image‑text encoder design) and training protocols (prompt engineering, fine‑tuning strategies) have a stronger influence on disparity reduction. This challenges the assumption that larger foundation models are inherently fairer.

NH‑Fair’s codebase, hyper‑parameter logs, and evaluation scripts are publicly released, enabling reproducibility and facilitating future research to benchmark new methods under identical conditions. By unifying vision and multimodal evaluation, formalizing a no‑harm fairness criterion, and empirically demonstrating the outsized role of ERM tuning and data augmentation, the paper provides both a practical toolkit and actionable insights for researchers and practitioners aiming to deploy fair, high‑performing visual AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment