Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis
This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS’s ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.
💡 Research Summary
**
This paper introduces the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a comprehensive framework that jointly exploits audio and video streams from laryngeal videostroboscopic examinations to automate the detection of vocal fold paralysis (VFP). The authors identify two major shortcomings in prior work: (1) reliance on either visual or acoustic data alone, which limits diagnostic precision, and (2) the need for manual selection of phonation cycles from long raw recordings, which is time‑consuming for clinicians.
MLVAS addresses these issues through a three‑stage pipeline. First, an audio keyword spotting (KWS) module processes the extracted audio using short‑time Fourier transform (STFT) to generate spectrograms. A lightweight convolutional network with residual blocks (detailed in Table 1) scans spectrogram chunks with a sliding‑window approach to detect the target phoneme “/E:/”, which corresponds to the patient’s sustained vowel during endoscopic examination. The resulting binary mask pinpoints temporal regions where the patient is vocalizing.
Second, the video module refines these temporal masks. A YOLO‑v5 based vocal‑fold detector, pre‑trained on the public BAGLS dataset, localizes the glottal region in each frame. An initial segmentation mask is produced by a U‑Net architecture, then a diffusion‑based refinement step is applied to suppress false positives and improve mask smoothness. This two‑step segmentation achieves a higher Intersection‑over‑Union (IoU) than a standalone U‑Net, as demonstrated on a public glottis segmentation benchmark.
Third, feature extraction proceeds on the refined segments. Visual features are derived by measuring the angular deviation of the left and right vocal folds relative to an estimated glottal midline, yielding Left‑Vocal‑Fold‑Dynamics (LVFDyn) and Right‑Vocal‑Fold‑Dynamics (RVFDyn). These angles are computed frame‑wise and aggregated into time‑series dynamics that capture both static separation and temporal motion patterns, enabling discrimination of unilateral (left‑ vs. right‑sided) paralysis.
Audio features are obtained from Dasheng, a state‑of‑the‑art self‑supervised audio encoder built on a Masked Audio Encoder (MAE) paradigm and trained on four large‑scale speech, event, and music corpora. By fine‑tuning Dasheng on the limited clinical dataset, the system extracts robust high‑dimensional embeddings without over‑fitting.
The multimodal classifier concatenates the audio embeddings with the visual angle and dynamics vectors and feeds them into either a multilayer perceptron or a lightweight transformer. Experimental evaluation on a real‑world clinical dataset shows that the multimodal model outperforms single‑modality baselines: accuracy improves by 4–7 percentage points, and the F1 score for unilateral VFP reaches 93.5 %. Ablation studies confirm that removing any component—KWS, diffusion refinement, or audio features—degrades performance, underscoring the complementary nature of the modalities.
Beyond quantitative results, MLVAS provides a visualisation interface that displays the extracted key video highlights together with plots of LVFDyn and RVFDyn over time, giving clinicians an objective, interpretable view of vocal‑fold behavior. This addresses the subjectivity inherent in traditional endoscopic assessment and reduces the manual workload associated with frame selection.
The authors acknowledge limitations, including the relatively small and demographically narrow dataset and the need for further optimisation for real‑time deployment. Nonetheless, the proposed system demonstrates that integrating pre‑trained audio encoders, diffusion‑enhanced segmentation, and bilateral vocal‑fold dynamics yields a powerful, clinically relevant tool for assisted diagnosis of vocal‑fold paralysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment