Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods

Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.


💡 Research Summary

This paper investigates demographic bias in the automated segmentation of the nucleus accumbens (NAc) from brain MRI, comparing three state‑of‑the‑art deep‑learning models (UNesT, nnU‑Net, CoTr) with a traditional atlas‑based method (ANTs). Using the Human Connectome Project Young Adult cohort, the authors curated a balanced dataset comprising four demographic subgroups—Black female, Black male, White female, and White male. For each subgroup, 30–33 subjects were used for training and 19–20 for testing, and manual expert segmentations served as the gold‑standard reference.

Four models were trained separately on each demographic group, deliberately creating “biased” training conditions. UNesT employs a hierarchical transformer encoder with 3‑D patches, nnU‑Net follows an automated pipeline that self‑optimizes preprocessing, architecture, and loss functions, CoTr combines convolutional and transformer blocks, and ANTs implements multi‑atlas joint label fusion. All models were run with their default configurations to reflect typical real‑world usage.

Performance was assessed with two conventional overlap metrics—Dice similarity coefficient (DSC) and Normalized Surface Dice (NSD)—and with a fairness‑oriented metric, Equity‑Scaled Segmentation Performance (ESSP), introduced by Tian et al. (2024). ESSP penalizes overall accuracy by the sum of absolute deviations of each subgroup’s accuracy from the overall mean, thus rewarding models that are both accurate and equitable.

Results show that UNesT and ANTs benefit markedly from race‑matched training: when the training set’s race matches that of the test subjects, both DSC and NSD improve by 3–5 % and ESSP scores rise substantially. This indicates that these methods are sensitive to anatomical variations linked to race. In contrast, nnU‑Net’s performance is largely invariant to race matching, maintaining a stable DSC (~0.88) and NSD (~0.85) across all test groups, and its ESSP shows minimal fluctuation. CoTr delivers intermediate results but exhibits higher variability in boundary‑focused NSD.

Beyond segmentation accuracy, the authors examined whether the derived NAc volumes preserve known demographic effects. Linear mixed‑effects models revealed that manual segmentations display a significant sex effect (larger volumes in males) with negligible race effects. UNesT and ANTs replicate the sex effect while the race effect disappears, whereas nnU‑Net attenuates the sex effect and also eliminates race differences. Thus, the choice of segmentation algorithm influences downstream volumetric analyses, potentially propagating or mitigating bias.

The discussion emphasizes three key insights. First, model architecture determines susceptibility to demographic bias: transformer‑heavy UNesT and atlas‑based ANTs rely heavily on the composition of the training atlas, while nnU‑Net’s self‑configuring pipeline confers robustness across demographics. Second, fairness metrics such as ESSP are essential because average Dice can mask subgroup disparities; ESSP provides a single figure that balances accuracy and equity. Third, the study highlights the practical importance of diverse, balanced training data, especially for models that are not inherently robust to demographic shifts.

Limitations include the modest sample size per subgroup, focus on a single subcortical structure, and the absence of bias‑mitigation strategies (e.g., data augmentation, domain adaptation, fairness‑aware loss functions). Future work should extend the analysis to additional brain regions, larger multi‑site cohorts, and explore algorithmic interventions that explicitly reduce demographic disparities.

In conclusion, the paper demonstrates that (1) deep‑learning and traditional segmentation methods can exhibit distinct patterns of demographic bias, (2) nnU‑Net offers comparatively stable performance regardless of race matching, and (3) incorporating fairness‑aware evaluation is crucial for the responsible deployment of AI‑based neuroimaging tools. These findings provide actionable guidance for researchers and clinicians seeking equitable automated brain‑MRI segmentation.


Comments & Academic Discussion

Loading comments...

Leave a Comment