A Dual Pipeline Machine Learning Framework for Automated Multi Class Sleep Disorder Screening Using Hybrid Resampling and Ensemble Learning

A Dual Pipeline Machine Learning Framework for Automated Multi Class Sleep Disorder Screening Using Hybrid Resampling and Ensemble Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate classification of sleep disorders, particularly insomnia and sleep apnea, is important for reducing long term health risks and improving patient quality of life. However, clinical sleep studies are resource intensive and are difficult to scale for population level screening. This paper presents a Dual Pipeline Machine Learning Framework for multi class sleep disorder screening using the Sleep Health and Lifestyle dataset. The framework consists of two parallel processing streams: a statistical pipeline that targets linear separability using Mutual Information and Linear Discriminant Analysis, and a wrapper based pipeline that applies Boruta feature selection with an autoencoder for non linear representation learning. To address class imbalance, we use the hybrid SMOTETomek resampling strategy. In experiments, Extra Trees and K Nearest Neighbors achieved an accuracy of 98.67%, outperforming recent baselines on the same dataset. Statistical testing using the Wilcoxon Signed Rank Test indicates that the improvement over baseline configurations is significant, and inference latency remains below 400 milliseconds. These results suggest that the proposed dual pipeline design supports accurate and efficient automated screening for non invasive sleep disorder risk stratification.


💡 Research Summary

This paper addresses the pressing need for scalable, non‑invasive screening of sleep disorders—particularly insomnia and obstructive sleep apnea (OSA)—by proposing a Dual‑Pipeline Machine Learning Framework (DPMLF) built on the publicly available “Sleep Health and Lifestyle” dataset (374 records, 13 features). Recognizing that conventional polysomnography (PSG) is costly and unsuitable for population‑level screening, the authors design a system that processes the same data through two parallel feature‑engineering streams.

The statistical pipeline first computes Mutual Information between each feature and the target, selects the most informative variables, scales them with RobustScaler, and applies Linear Discriminant Analysis (LDA) to obtain a low‑dimensional space optimized for linear separability. The wrapper‑based pipeline, in contrast, employs Boruta (a Random‑Forest‑based importance selector) to retain only the most predictive features, scales them with MinMaxScaler, and feeds them into an autoencoder that learns compact nonlinear representations. Both pipelines generate complementary feature sets that are later concatenated or fed into separate classifiers.

To mitigate the severe class imbalance (219 “None”, 78 “Sleep Apnea”, 77 “Insomnia”), the authors adopt a hybrid resampling strategy—SMOTETomek. SMOTE synthetically augments minority‑class samples, while Tomek Links prune ambiguous boundary instances, thereby reducing noise and preventing over‑fitting.

A suite of classifiers is evaluated; Extra Trees (Extremely Randomized Trees) and K‑Nearest Neighbors (KNN) emerge as the top performers. Using 5‑fold cross‑validation, the combined model achieves an average accuracy of 98.67 %, with precision, recall, and F1‑scores all exceeding 0.98. The Wilcoxon Signed‑Rank Test confirms that this improvement over recent baselines (the best of which reported 96.88 % accuracy) is statistically significant (p < 0.01). Inference latency remains below 400 ms, making the solution viable for real‑time deployment on mobile or wearable platforms.

Key contributions include: (1) a novel dual‑pipeline architecture that separately optimizes linear and nonlinear feature relationships, reducing information loss during preprocessing; (2) the integration of SMOTETomek to robustly address class imbalance; (3) state‑of‑the‑art classification performance with low computational overhead; and (4) rigorous statistical validation to ensure reproducibility. Ethical considerations such as data privacy, bias assessment, and the necessity of clinician oversight are discussed. Future work will involve external validation on larger multi‑center cohorts, incorporation of explainable AI techniques, and seamless integration with wearable sensor streams for continuous, real‑world sleep health monitoring.


Comments & Academic Discussion

Loading comments...

Leave a Comment