Dynamic Facial Expressions Analysis Based Parkinson's Disease Auxiliary Diagnosis

Dynamic Facial Expressions Analysis Based Parkinson's Disease Auxiliary Diagnosis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Parkinson’s disease (PD), a prevalent neurodegenerative disorder, significantly affects patients’ daily functioning and social interactions. To facilitate a more efficient and accessible diagnostic approach for PD, we propose a dynamic facial expression analysis-based PD auxiliary diagnosis method. This method targets hypomimia, a characteristic clinical symptom of PD, by analyzing two manifestations: reduced facial expressivity and facial rigidity, thereby facilitating the diagnosis process. We develop a multimodal facial expression analysis network to extract expression intensity features during patients’ performance of various facial expressions. This network leverages the CLIP architecture to integrate visual and textual features while preserving the temporal dynamics of facial expressions. Subsequently, the expression intensity features are processed and input into an LSTM-based classification network for PD diagnosis. Our method achieves an accuracy of 93.1%, outperforming other in-vitro PD diagnostic approaches. This technique offers a more convenient detection method for potential PD patients, improving their diagnostic experience.


💡 Research Summary

This paper presents a novel auxiliary diagnostic method for Parkinson’s disease (PD) based on the analysis of dynamic facial expressions. The research focuses on hypomimia, a characteristic clinical symptom of PD that manifests as both reduced facial expressivity and facial muscle rigidity. The core premise is that analyzing videos of patients performing facial expressions provides a more comprehensive assessment of hypomimia compared to static image analysis, capturing the dimension of rigidity and impaired movement dynamics.

The proposed framework operates in two main stages. First, a multimodal dynamic facial expression analysis network extracts expression intensity features. This network is built upon the CLIP (Contrastive Language-Image Pre-training) architecture. The visual processing stream samples frames from input videos, extracts features using a Vision Transformer (ViT) enhanced with a Multi-Head Self-Attention (MHSA) module, and then employs a Transformer-based temporal model to preserve sequential information. The textual processing stream uses descriptive prompts (e.g., “a face with a happy expression”) rather than simple class labels, fed into a text encoder. The visual and textual features are fused via cosine similarity to compute a final expression intensity score for each frame sequence.

The second stage is the PD classification network. For diagnosis, each subject is asked to perform four basic expressions: neutral, happiness, surprise, and anger, resulting in four videos. The expression analysis network processes these, yielding a 16-dimensional feature vector (4 intensities per video). A key innovation is a dedicated data processing step applied to these raw intensity features. The features are grouped by the input expression label, and for each group, eight statistical measures are calculated: Highlight Value (the intensity corresponding to the intended expression), Mean, Standard Deviation, Z-Score, Percentage Difference, Range, Difference from Minimum, and Difference from Maximum. This processing amplifies the subtle differences between PD patients and healthy controls (HC). The processed features are then fed into a classification network based on Long Short-Term Memory (LSTM) with a residual structure to mitigate gradient vanishing, which outputs the final PD/HC diagnosis.

The models were trained on the DFEW (Dynamic Facial Expressions in-the-Wild) dataset and evaluated on a custom-collected PD-FEV (Parkinson’s Disease Facial Expression Videos) dataset, created in collaboration with a hospital, containing videos from 97 PD patients and 76 HC subjects. Experiments showed that the expression analysis network itself achieved lower recognition accuracy on PD patients (58.2%) compared to HC (80.9%), quantitatively confirming their impaired expressivity. Visualization via t-SNE plots revealed significant feature overlap, especially between neutral and angry expressions in the PD group, indicating their difficulty in producing distinct expressions.

Ablation studies demonstrated the critical importance of the data processing step, which significantly boosted the performance of sequential models like LSTM and GRU. The final proposed system—combining the CLIP-based dynamic expression analyzer, the statistical feature processing, and the residual LSTM classifier—achieved a diagnostic accuracy of 93.1% using five-fold cross-validation, outperforming other classifiers such as SVM, Random Forest, and GRU.

In conclusion, this work establishes an effective, non-invasive, and quantitative framework for PD auxiliary diagnosis by dynamically analyzing facial expressions. By targeting hypomimia through a multimodal video analysis approach, it captures both the reduction and rigidity aspects of the symptom. The method’s reliance on smartphone-capturable video data positions it as a highly accessible tool for potential early screening and remote diagnosis, potentially improving diagnostic experiences and enabling earlier intervention.


Comments & Academic Discussion

Loading comments...

Leave a Comment