Stride-Net: Fairness-Aware Disentangled Representation Learning for Chest X-Ray Diagnosis

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep neural networks for chest X-ray classification achieve strong average performance, yet often underperform for specific demographic subgroups, raising critical concerns about clinical safety and equity. Existing debiasing methods frequently yield inconsistent improvements across datasets or attain fairness by degrading overall diagnostic utility, treating fairness as a post hoc constraint rather than a property of the learned representation. In this work, we propose Stride-Net (Sensitive Attribute Resilient Learning via Disentanglement and Learnable Masking with Embedding Alignment), a fairness-aware framework that learns disease-discriminative yet demographically invariant representations for chest X-ray analysis. Stride-Net operates at the patch level, using a learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. To anchor representations in clinical semantics and discourage shortcut learning, we further enforce semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport. We evaluate Stride-Net on the MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across architectures including ResNet and Vision Transformers, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving a more favorable accuracy-fairness trade-off than prior debiasing approaches. Our code is available at https://github.com/Daraksh/Fairness_StrideNet.

💡 Research Summary

The paper introduces Stride‑Net, a fairness‑aware representation learning framework for chest X‑ray (CXR) diagnosis that simultaneously improves diagnostic accuracy and reduces performance disparities across demographic groups such as race and gender. The authors argue that existing debiasing techniques—typically applied as pre‑processing, in‑processing, or post‑processing steps—often yield inconsistent gains or sacrifice overall utility, especially in medical imaging where sensitive attributes are subtly embedded in the data.

Core Architecture

Patch‑level visual encoding – A Vision Transformer (ViT) backbone splits each CXR into non‑overlapping patches and produces a set of d‑dimensional embeddings {eᵢᵖ}. This granularity allows the model to focus on localized anatomical regions (e.g., lung fields) rather than global image statistics that may carry demographic cues.
Semantic label embedding – Disease labels are encoded with a pre‑trained BioBERT model, yielding embeddings {eⱼˡ} in the same latent space as the visual patches. Aligning visual and textual semantics provides weak supervision that steers the network toward clinically meaningful features.
Learnable stride mask – A parameter matrix Mθ computes relevance scores between each patch embedding and each label embedding. The mask selects a sparse subset of patches that are most aligned with the target disease, forming the latent representation Z. This “stride‑select” operation enforces sparsity, interpretability, and reduces reliance on demographic shortcuts.
Group‑Optimal Transport (GOT) loss – GOT consists of two terms: (i) a transport cost c(eᵖ, eˡ) that pulls selected patches toward their corresponding label embeddings, and (ii) a group regularizer that penalizes distributional differences of Z across sensitive attribute groups. A trade‑off parameter λ balances accuracy‑driven alignment against fairness‑driven uniformity.

Adversarial Disentanglement
Two auxiliary classifiers are attached to Z. The first predicts the sensitive attribute s with a standard cross‑entropy loss Lₛ, encouraging the network to be aware of demographic information. The second, an adversary trained via a Gradient Reversal Layer, maximizes confusion (loss L_conf) so that the feature extractor learns representations that are uninformative about s.

Joint Optimization
The total objective is:
L_total = L_c (disease CE) + α·L_GOT + β·Lₛ – γ·L_conf,
where α, β, γ weight semantic alignment, attribute supervision, and adversarial debiasing respectively. This end‑to‑end loss forces the model to be discriminative for disease, semantically grounded, and invariant to sensitive attributes.

Experimental Setup
The authors evaluate on two large‑scale CXR datasets: MIMIC‑CXR and CheXpert. They focus on the binary “No Finding” task, which is clinically critical because false negatives delay care while false positives cause unnecessary follow‑ups. Sensitive attributes (race, gender) are annotated following prior work; samples lacking these annotations are excluded. Images are resized to 224×224, and standard augmentations are applied. Training uses Adam (lr = 1e‑4), batch size 64, for 20 epochs on an RTX A6000 GPU.

Baselines

ERM ResNet‑18 (standard training)
UBAIA (a recent fairness‑aware method targeting underdiagnosis)
CheXclusion (removes spurious correlations via exclusion)

Metrics

Predictive Quality Disparity (PQD) – ratio of worst to best subgroup accuracy.
Equality of Opportunity Measure (EOM) – average across disease classes of the ratio of minimum to maximum true‑positive rates across subgroups.

Results
Stride‑Net consistently outperforms all baselines. On MIMIC‑CXR, average accuracy rises from 0.789 (UBAIA) to 0.805, while PQD improves from 0.850 to 0.922 and EOM from 0.680 to 0.870. Similar gains appear on CheXpert (accuracy +0.8 pp, EOM +5.8 pp). Intersectional race‑gender groups also see higher fairness scores without sacrificing overall performance. The authors highlight that the stride mask yields interpretable attention maps focusing on lung fields, and the GOT alignment prevents the model from exploiting demographic shortcuts.

Contributions and Impact

Introduces a novel stride‑mask mechanism for label‑aligned patch selection.
Leverages BioBERT disease embeddings and Group‑Optimal Transport to ground visual features in clinical semantics.
Combines supervised and adversarial attribute learning to achieve demographic invariance.
Demonstrates robust improvements across two major CXR benchmarks and multiple sensitive subgroups, especially reducing underdiagnosis of “No Finding”.

The code is publicly released, facilitating reproducibility and future extensions to other imaging modalities (CT, MRI) or additional protected attributes (age, socioeconomic status). By embedding fairness directly into representation learning rather than as a post‑hoc constraint, Stride‑Net offers a promising pathway toward trustworthy, equitable AI systems for radiology.

Stride-Net: Fairness-Aware Disentangled Representation Learning for Chest X-Ray Diagnosis

💡 Research Summary

Comments & Academic Discussion

Leave a Comment