Evidence for Phenotype-Driven Disparities in Freezing of Gait Detection and Approaches to Bias Mitigation

Evidence for Phenotype-Driven Disparities in Freezing of Gait Detection and Approaches to Bias Mitigation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Freezing of gait (FOG) is a debilitating symptom of Parkinson’s disease (PD) and a common cause of injurious falls. Recent advances in wearable-based human activity recognition (HAR) enable FOG detection, but bias and fairness in these models remain understudied. Bias refers to systematic errors leading to unequal outcomes, while fairness refers to consistent performance across subject groups. Biased models could systematically underserve patients with specific FOG phenotypes or demographics, potentially widening care disparities. We systematically evaluated bias and fairness of state-of-the-art HAR models for FOG detection across phenotypes and demographics using multi-site datasets. We assessed four mitigation approaches: conventional methods (threshold optimization and adversarial debiasing) and transfer learning approaches (multi-site transfer and fine-tuning large pretrained models). Fairness was quantified using demographic parity ratio (DPR) and equalized odds ratio (EOR). HAR models exhibited substantial bias (DPR & EOR < 0.8) across age, sex, disease duration, and critically, FOG phenotype. Phenotype-specific bias is particularly concerning as tremulous and akinetic FOG require different clinical management. Conventional bias mitigation methods failed: threshold optimization (DPR=-0.126, EOR=+0.063) and adversarial debiasing (DPR=-0.008, EOR=-0.001) showed minimal improvement. In contrast, transfer learning from multi-site datasets significantly improved fairness (DPR=+0.037, p<0.01; EOR=+0.045, p<0.01) and performance (F1-score=+0.020, p<0.05). Transfer learning across diverse datasets is essential for developing equitable HAR models that reliably detect FOG across all patient phenotypes, ensuring wearable-based monitoring benefits all individuals with PD.


💡 Research Summary

Freezing of gait (FOG) is a sudden, episodic motor symptom of Parkinson’s disease (PD) that dramatically increases fall risk. Because FOG often does not appear during brief clinical examinations, wearable sensor–based human activity recognition (HAR) models have been proposed to detect FOG continuously in real‑world settings. While recent work has demonstrated high overall sensitivity and specificity, little attention has been paid to whether these models perform equitably across patient sub‑groups, especially across the two major FOG phenotypes—tremulous and akinetic—whose underlying pathophysiology and therapeutic needs differ.

In this study, Odonga et al. assembled four publicly available multi‑site FOG datasets (Daphnet, De Souza, DeFOG, and tDCS‑FOG), comprising 145 PD participants (31 % female, mean age = 67 ± 9 y, disease duration = 10 ± 6 y). The datasets vary in sensor placement (ankle, lower back, shank, thigh), sampling frequency (64–128 Hz), and FOG‑provoking protocols (Timed‑Up‑and‑Go, turning tasks, dual‑task walking). Across all recordings, non‑FOG data dominate (≈86 % of samples) and tremulous FOG episodes are far more common than akinetic ones (≈82 % vs 18 %). Importantly, each dataset provides episode‑level annotations that enable phenotype‑specific analysis.

The authors implemented five state‑of‑the‑art HAR classifiers: (A1) Random Forest, (A2) DeepConvLSTM, (A3) Masked Transformer, (B1) a recent CNN‑based architecture (Yuan 2024), and (B2) a newer Transformer‑style model (Ruan 2025). All models were trained from scratch on sliding windows (3 s for three datasets, 4.5 s for Daphnet) with min‑max scaling, and evaluated using a 5‑fold cross‑validation scheme.

To quantify fairness, three group‑fairness notions were adopted. Demographic Parity Ratio (DPR) measures equality of positive prediction rates across protected groups; Equalized Odds Ratio (EOR) captures the worst‑case of true‑positive parity (TPPR) and false‑positive parity (FPRR); Equality of Opportunity Difference (EOD) reflects the absolute difference in true‑positive rates. Protected attributes examined were age, sex, disease duration, and—crucially—FOG phenotype (tremulous vs akinetic). A model was deemed fair when DPR and EOR were ≥ 0.8.

Baseline results showed substantial bias: every model fell below the 0.8 threshold for at least one attribute, with the most pronounced disparity for phenotype (akinetic FOG detection rates were markedly lower). Conventional bias‑mitigation techniques—post‑hoc threshold optimization and adversarial debiasing—failed to improve fairness (threshold optimization yielded DPR = ‑0.126, EOR = +0.063; adversarial debiasing gave DPR = ‑0.008, EOR = ‑0.001).

In contrast, transfer‑learning strategies that leveraged the multi‑site data dramatically reduced bias. After pre‑training on the combined datasets and fine‑tuning on each target site, DPR increased by +0.037 (p < 0.01) and EOR by +0.045 (p < 0.01). Overall classification performance also improved, with a mean F1‑score gain of +0.020 (p < 0.05). Fine‑tuning large, publicly available pretrained time‑series models produced comparable gains, especially boosting true‑positive rates for akinetic episodes.

The authors argue that the success of transfer learning stems from exposure to heterogeneous sensor placements, task protocols, and demographic mixes, which forces the model to learn more robust, phenotype‑agnostic representations. They also emphasize that evaluating fairness through multiple complementary metrics (DPR, TPPR, FPRR, EOD) provides a richer picture than a single accuracy figure and should become standard practice for clinical AI.

Limitations include persistent label imbalance (few akinetic episodes), reliance on a limited set of sensor locations, and the absence of real‑time deployment assessments (latency, battery consumption). Ethical considerations around defining “protected attributes” in a clinical context are also highlighted.

In summary, this work is the first systematic application of Fair‑ML concepts to wearable‑based FOG detection. It demonstrates that phenotype‑specific bias is a real and clinically relevant problem, that traditional post‑processing debiasing methods are insufficient, and that multi‑site transfer learning offers a practical, effective pathway to equitable, high‑performing FOG monitoring tools. Future research should expand the diversity of data sources, explore additional sensor modalities, and integrate fairness constraints directly into model training to further close performance gaps across all PD patient sub‑populations.


Comments & Academic Discussion

Loading comments...

Leave a Comment