Automated Dysphagia Screening Using Noninvasive Neck Acoustic Sensing
Pharyngeal health plays a vital role in essential human functions such as breathing, swallowing, and vocalization. Early detection of swallowing abnormalities, also known as dysphagia, is crucial for timely intervention. However, current diagnostic methods often rely on radiographic imaging or invasive procedures. In this study, we propose an automated framework for detecting dysphagia using portable and noninvasive acoustic sensing coupled with applied machine learning. By capturing subtle acoustic signals from the neck during swallowing tasks, we aim to identify patterns associated with abnormal physiological conditions. Our approach achieves promising test-time abnormality detection performance, with an AUC-ROC of 0.904 under 5 independent train-test splits. This work demonstrates the feasibility of using noninvasive acoustic sensing as a practical and scalable tool for pharyngeal health monitoring.
💡 Research Summary
Dysphagia, affecting up to 20 % of adults over 50 and a majority of patients with Parkinson’s disease, stroke, or head‑and‑neck cancer, remains a major public‑health concern because delayed diagnosis can lead to aspiration pneumonia and prolonged hospitalization. Current gold‑standard assessments—flexible endoscopic evaluation of swallowing (FEES) and videofluoroscopic swallowing studies (VFSS)—are invasive, require radiation exposure, need specialized clinicians, and are costly, limiting their routine use for screening at‑risk patients.
To address these shortcomings, the authors propose a fully automated, non‑invasive dysphagia screening system that captures subtle acoustic signals from the neck during swallowing using a portable digital stethoscope (3M Littmann Core). The study enrolled 49 participants (25 F/24 M) from the UC San Diego Center for Airway, Voice and Swallowing, each undergoing a standard FEES protocol with 8–10 trials of varied bolus consistencies. While the laryngoscope recorded video, the digital stethoscope recorded audio laterally to the thyroid cartilage. After discarding recordings contaminated by speech or other non‑swallow sounds, the authors segmented 617 individual swallow events (average duration ≈ 0.64 s) using a custom amplitude‑threshold and gap‑time algorithm tuned against the video ground truth.
Feature extraction proceeded along three parallel streams: (1) domain‑informed acoustic descriptors derived from signal processing—maximum five frequencies (FFT), average/median frequency, peak and mean amplitude, area under the waveform (trapezoidal integration), swallow length, and swallow count; (2) a comprehensive set of >6 000 features from the OpenSMILE toolkit; and (3) embeddings from three pre‑trained audio models (OPERA, AST, CLAP). Demographic variables (age, gender) were appended to all feature vectors.
For classification, the authors evaluated Random Forest (RFC) and Support Vector Machine (SVM) models; performance was comparable, and RFC was selected for consistency across experiments. The data were split at the patient level into five independent train‑test partitions, preserving class balance and swallow distribution, to emulate a realistic clinical deployment where the model must generalize to unseen patients. Primary evaluation metrics were area under the receiver‑operating‑characteristic curve (AUC‑ROC) and balanced accuracy.
In the binary abnormality detection task (normal vs. any dysphagia), the domain‑informed feature set alone achieved an AUC‑ROC of 0.904 ± 0.015 and balanced accuracy of 0.913 ± 0.075, markedly outperforming the OPERA embeddings (AUC‑ROC ≈ 0.65) and OpenSMILE (AUC‑ROC ≈ 0.78). Adding OpenSMILE features to the domain set actually reduced performance to an AUC‑ROC of 0.804, suggesting that many OpenSMILE descriptors capture irrelevant or noisy aspects of the signal. In the three‑class severity classification (normal, penetration, aspiration), performance dropped (AUC‑ROC ≈ 0.61) due to limited per‑class samples and class imbalance, highlighting the need for larger, more balanced datasets for fine‑grained risk stratification.
Because real‑world recordings contain multiple swallows per trial, the authors also investigated automatic segmentation strategies. A fixed‑parameter method (optimized thresholds, gap time = 0.6 s, top dB = 20) achieved an intersection‑over‑union of 0.4775, sensitivity 65.8 % and specificity 87.6 %. When aggregating swallow‑level predictions to the patient level, the “max‑risk” strategy (taking the highest predicted risk among a patient’s swallows) yielded the highest AUC‑ROC of 0.942 under the fixed‑parameter segmentation. A simpler sliding‑window approach (1‑second windows with 50 % overlap) performed better under mean‑risk and mode‑risk aggregations, offering a computationally lightweight solution suitable for real‑time deployment. Human‑annotated swallow segments served as an upper bound, achieving an AUC‑ROC of 0.967, confirming that the acoustic signal indeed contains strong diagnostic information and that further improvements in automatic segmentation could close the gap.
SHAP (SHapley Additive exPlanations) analysis identified the most influential predictors: older age and male gender increased dysphagia risk, consistent with epidemiological literature. Among acoustic features, average amplitude, average frequency, area under the curve, and swallow count were top contributors; weaker, shorter swallows were associated with abnormal findings. This dual importance of demographic and signal‑based features underscores the value of integrating clinical context with raw acoustic data.
The study acknowledges several limitations. The cohort size (49 subjects) and single‑center origin restrict generalizability across diverse populations and disease etiologies. Automatic segmentation, while promising, still lags behind expert manual labeling, which may affect real‑world sensitivity, especially for low‑severity cases. Moreover, the current evaluation was confined to the controlled environment of a hospital FEES session; translation to at‑home or telehealth settings will require robust noise‑rejection and possibly alternative sensor placements (e.g., smartphone microphones).
Future work outlined by the authors includes expanding the dataset through multi‑institution collaborations, refining segmentation algorithms (potentially via deep learning sequence models such as CNN‑RNN or Transformers), and conducting prospective trials in community and home environments. The authors also suggest exploring end‑to‑end deep models that can jointly learn feature representations and classification, which may further boost performance, especially for the multi‑class severity task.
In conclusion, this research demonstrates that a low‑cost, non‑invasive neck acoustic sensor, combined with carefully engineered domain features and conventional machine‑learning classifiers, can reliably detect dysphagia with an AUC‑ROC exceeding 0.90. The approach offers a scalable alternative to radiographic or endoscopic assessments, paving the way for continuous, real‑time monitoring of pharyngeal health in both clinical and remote care contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment