Fully-automated sleep staging: multicenter validation of a generalizable deep neural network for Parkinson's disease and isolated REM sleep behavior disorder

Fully-automated sleep staging: multicenter validation of a generalizable deep neural network for Parkinson's disease and isolated REM sleep behavior disorder
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Isolated REM sleep behavior disorder (iRBD) is a key prodromal marker of Parkinson’s disease (PD), and video-polysomnography (vPSG) remains the diagnostic gold standard. However, manual sleep staging is particularly challenging in neurodegenerative diseases due to EEG abnormalities and fragmented sleep, making PSG assessments a bottleneck for deploying new RBD screening technologies at scale. We adapted U-Sleep, a deep neural network, for generalizable sleep staging in PD and iRBD. A pretrained U-Sleep model, based on a large, multisite non-neurodegenerative dataset (PUB; 19,236 PSGs across 12 sites), was fine-tuned on research datasets from two centers (Lundbeck Foundation Parkinson’s Disease Research Center (PACE) and the Cologne-Bonn Cohort (CBC); 112 PD, 138 iRBD, 89 age-matched controls. The resulting model was evaluated on an independent dataset from the Danish Center for Sleep Medicine (DCSM; 81 PD, 36 iRBD, 87 sleep-clinic controls). A subset of PSGs with low agreement between the human rater and the model (Cohen’s $κ$ < 0.6) was re-scored by a second blinded human rater to identify sources of disagreement. Finally, we applied confidence-based thresholds to optimize REM sleep staging. The pretrained model achieved mean $κ$ = 0.81 in PUB, but $κ$ = 0.66 when applied directly to PACE/CBC. By fine-tuning the model, we developed a generalized model with $κ$ = 0.74 on PACE/CBC (p < 0.001 vs. the pretrained model). In DCSM, mean and median $κ$ increased from 0.60 to 0.64 (p < 0.001) and 0.64 to 0.69 (p < 0.001), respectively. In the interrater study, PSGs with low agreement between the model and the initial scorer showed similarly low agreement between human scorers. Applying a confidence threshold increased the proportion of correctly identified REM sleep epochs from 85% to 95.5%, while preserving sufficient (> 5 min) REM sleep for 95% of subjects.


💡 Research Summary

The paper addresses a critical bottleneck in the clinical assessment of Parkinson’s disease (PD) and isolated REM sleep behavior disorder (iRBD): the labor‑intensive, error‑prone manual scoring of video‑polysomnography (vPSG) data. While manual staging is already challenging in healthy populations, neurodegenerative patients present additional difficulties due to EEG abnormalities, fragmented sleep architecture, and frequent arousals. To overcome this, the authors adapted U‑Sleep, a deep neural network originally trained on a massive, multisite, non‑neurodegenerative dataset (the PUB cohort, 19,236 PSGs from 12 sites), and evaluated its generalizability to PD and iRBD cohorts across several centers.

Model pre‑training and baseline performance
U‑Sleep is a U‑Net‑style 1‑D convolutional architecture that predicts the five standard sleep stages (W, N1, N2, N3, REM) on 30‑second epochs. When applied to the PUB data, the pretrained model achieved a mean Cohen’s κ of 0.81, confirming excellent agreement with human raters in a heterogeneous but neurologically “normal” population.

Transfer to neurodegenerative cohorts
Direct application of the pretrained model to the Parkinson’s and iRBD research datasets from the Lundbeck Foundation Parkinson’s Disease Research Center (PACE) and the Cologne‑Bonn Cohort (CBC) resulted in a substantial drop in performance (κ = 0.66). This highlighted the model’s sensitivity to disease‑specific EEG patterns and sleep fragmentation. The authors therefore fine‑tuned the network using 339 PSG recordings (112 PD, 138 iRBD, 89 age‑matched controls). Fine‑tuning was limited to the last three decoder blocks and batch‑normalisation parameters to avoid over‑fitting given the modest sample size. After this adaptation, κ rose to 0.74 (p < 0.001 versus the pretrained model), with the most pronounced gains in REM staging (κ = 0.62 → 0.71).

External multicenter validation
An independent test set from the Danish Center for Sleep Medicine (DCSM) comprised 81 PD, 36 iRBD, and 87 sleep‑clinic controls (total = 204). On this cohort, the fine‑tuned model’s mean κ improved from 0.60 to 0.64 and the median κ from 0.64 to 0.69 (both p < 0.001), demonstrating that the adaptation generalized across different recording equipment, scoring protocols, and geographic locations.

Human‑model disagreement analysis
To investigate residual errors, the authors identified PSGs where the model and the original human scorer disagreed (κ < 0.6). A second blinded human scorer re‑rated these recordings. The inter‑rater κ for these problematic epochs was similarly low, indicating that the disagreement stemmed from intrinsically ambiguous sleep segments rather than model failure. This finding underscores the importance of considering human scoring variability when benchmarking AI systems.

Confidence‑based REM optimization
U‑Sleep outputs a probability distribution for each epoch, which the authors leveraged as a confidence score. By imposing a threshold (≥ 0.8) on this score for REM epochs, they filtered out low‑confidence predictions. This post‑processing step boosted the proportion of correctly identified REM epochs from 85 % to 95.5 % while still preserving at least five minutes of REM sleep in 95 % of subjects—a clinically acceptable trade‑off that enhances the reliability of REM‑focused biomarkers such as iRBD.

Implications and future directions
The study demonstrates a pragmatic pipeline: (1) pre‑train on a large, heterogeneous dataset; (2) fine‑tune on disease‑specific data; (3) validate across independent centers. This approach yields a model that is both accurate and robust to variations in hardware and scoring conventions, making it suitable for large‑scale screening programs. Clinically, automated, high‑confidence REM detection could accelerate the identification of iRBD, a prodromal marker for PD, thereby facilitating earlier intervention trials.

Limitations include the predominance of European/North‑American sites, which may limit applicability to other populations, and the residual errors in lighter sleep stages (N1/N2) that still require improvement. Future work could explore real‑time deployment, integration of additional physiological signals (e.g., heart‑rate variability, actigraphy), and extension to other neurodegenerative disorders such as Alzheimer’s disease.

In summary, by fine‑tuning a deep neural network originally trained on a massive non‑neurodegenerative PSG cohort, the authors produced a generalized, high‑performing automated sleep staging system for PD and iRBD. The model achieves κ values comparable to expert human scorers, retains performance across multiple centers, and, with confidence‑based filtering, delivers highly reliable REM staging—paving the way for scalable, AI‑driven sleep diagnostics in neurodegenerative disease research and clinical practice.


Comments & Academic Discussion

Loading comments...

Leave a Comment