Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms. Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.

💡 Research Summary

This paper tackles the pressing problem of demographic bias in AI‑driven glaucoma diagnosis from fundus photographs. While large vision‑language models (VLMs) such as Qwen2.5‑VL have demonstrated expert‑level performance on medical imaging tasks, they often exhibit substantial accuracy gaps across sensitive groups (race, ethnicity, gender), which can exacerbate health inequities. The authors propose a fairness‑aware fine‑tuning framework that couples parameter‑efficient Low‑Rank Adaptation (LoRA) with explicit fairness objectives, requiring only 0.24 % of the model’s parameters to be updated.

The core technical contribution is the differentiable MaxAccGap loss. MaxAccGap measures the maximum difference between group‑wise accuracies, directly reflecting the clinical notion of “equal diagnostic accuracy for all patients.” Because the hard accuracy indicator (arg max) is non‑differentiable, the authors approximate it with soft accuracy—i.e., the model’s predicted probability for the true class. The soft MaxAccGap is then incorporated as a regularization term, enabling end‑to‑end gradient‑based optimization.

Three training strategies are explored:

FR‑LoRA (Fairness‑Regularized LoRA) adds λ·MaxAccGap_soft to the standard cross‑entropy loss. The gradient from the MaxAccGap term pushes samples from the worst‑performing group to increase their predicted probabilities while applying opposite pressure to the best‑performing group. λ controls the fairness‑accuracy trade‑off.
GR‑LoRA (Group‑Reweighted LoRA) balances gradient contributions by weighting each group’s cross‑entropy loss with the inverse of its frequency (clipped at a maximum of 10). This indirect approach does not explicitly minimize MaxAccGap but empirically reduces it by forcing the model to learn features useful for minority groups.
Hybrid‑LoRA combines both mechanisms, applying group‑wise weighting and the MaxAccGap regularizer simultaneously, thereby addressing data imbalance and performance imbalance together.

Experiments are conducted on the Harvard Glaucoma Fairness Dataset (10 k fundus images with demographic annotations). The dataset is highly imbalanced, especially for ethnicity (Non‑Hispanic 90.3 % vs Hispanic 4.3 %). The VLM is fine‑tuned with LoRA (rank r = 32, scaling α = 64) applied to all attention projection matrices, yielding roughly 20 M trainable parameters. The vision encoder is frozen; only the language side is adapted. Training uses AdamW (lr = 1e‑5), batch size 8 (via gradient accumulation), and three epochs.

Key results: GR‑LoRA achieves a 69 % reduction in MaxAccGap (from 3.80 % to 1.17 %) while preserving an overall accuracy of 53.15 %. FR‑LoRA and Hybrid‑LoRA also improve fairness, achieving similar or slightly higher overall accuracies (≈ 54‑55 %) with 55‑60 % gap reductions. Ablation studies show that stronger regularization (higher λ) yields larger gap reductions at the cost of modest accuracy loss, confirming the expected trade‑off. Moreover, optimizing MaxAccGap also improves other fairness metrics such as Equalized Odds, suggesting that accuracy parity is a strong proxy for broader fairness in medical diagnosis.

The paper’s contributions are threefold:

Clinical interpretability – MaxAccGap directly quantifies diagnostic parity, a metric that clinicians and patients can readily understand, unlike abstract statistical parity measures.
Parameter efficiency – By leveraging LoRA, the approach avoids full model fine‑tuning, reducing over‑fitting risk on small medical datasets and enabling deployment in resource‑constrained settings.
Generalizability – The framework is model‑agnostic and can be applied to other multimodal medical tasks or larger VLMs, offering a practical pathway toward equitable AI‑assisted healthcare.

Future work may explore multi‑task extensions, incorporation of additional sensitive attributes, and real‑world clinical integration to validate the impact on patient outcomes. Overall, the study provides a compelling, technically sound solution for mitigating demographic bias in high‑capacity vision‑language models used for glaucoma diagnosis.

Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis

💡 Research Summary

Comments & Academic Discussion

Leave a Comment