Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models

Reading time: 6 minute
...

📝 Abstract

Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:https://github.com/manaralnaasan/RGB-D_parkinson-LLM

💡 Analysis

Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:https://github.com/manaralnaasan/RGB-D_parkinson-LLM

📄 Content

Parkinson’s disease (PD) is a progressive neurodegenerative disorder that affects motor control, resulting in distinctive gait abnormalities such as reduced arm swing, short stride length, forward trunk flexion, and turning difficulties. Early detection of these gait-related symptoms is vital for timely clinical intervention and monitoring disease progression.

In recent years, computer vision-based gait analysis has emerged as a promising tool for automated PD screening.

However, most existing approaches rely primarily on single-modality inputs, such as RGB video, silhouettes, or inertial sensor data, which often fail to capture the full complexity of human motion, especially under real-world environmental conditions. Despite notable progress in deep learning-based gait recognition, two major challenges remain. First, the use of singlemodality visual cues limits robustness against variations in lighting, clothing, and viewpoint, reducing generalization to clinical or daily settings. Second, many current models operate as black boxes, providing accurate predictions but lacking interpretability, an essential requirement for medical applications where clinical trust and reasoning transparency are critical. Consequently, there is a growing need for gait recognition systems that are not only accurate and robust but also capable of offering human-understandable explanations of their decisions.

To address these limitations, this study proposes an explainable multimodal Parkinson’s disease gait recognition framework that integrates RGB and Depth (RGB-D) information to enhance spatial-temporal feature representation.

The proposed architecture employs dual YOLOv11-based encoders to extract complementary modality-specific features, followed by a new feature aware Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to model both fine-grained limb movements and overall gait patterns. To achieve interpretability, a frozen Large Language Model (LLM) is incorporated to convert fused visual embeddings and structured metadata (e.g., detected gait abnormalities) into textual clinical explanations, thereby bridging the gap between machine perception and medical reasoning.

Extensive experiments on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves superior accuracy and robustness compared to RGB-only or silhouette-based baselines. Moreover, the integration of the LLM enables clinically meaningful, explainable reporting of Parkinsonian gait subtypes, such as reduced arm swing or short stride, directly from visual evidence.

The key contributions of this work are summarized as follows:

• A novel multimodal RGB-D gait recognition framework is introduced, designed specifically for enhanced analysis of Parkinsonian gait under realistic and challenging conditions. To the best of our knowledge, it is the first framework to jointly exploit RGB-D fusion for explainable Parkinson’s gait assessment.

• A cross-spatial Neck Fusion mechanism and a Multi-Scale Local-Global Extraction (MLGE) module are proposed, enabling robust capture of both fine-grained local motion cues and global gait dynamics, even under occlusion, illumination changes, and variability in walking behavior.

• This study represents the first adaptation of state-of-the-art medical report generation, traditionally dominated by radiology, to the dynamic domain of gait analysis. By aligning RGB-D fusion features with a frozen LLM-based clinical reporting pipeline, the framework produces coherent, clinically meaningful explanations directly from multimodal gait representations.

• Extensive experiments demonstrate superior robustness, accuracy, and interpretability, with the proposed system outperforming unimodal baselines and existing multimodal methods across diverse evaluation settings, especially in visually challenging or label-limited scenarios. By unifying visual recognition and language reasoning, this research establishes a new paradigm for explainable Parkinson’s disease gait analysis, paving the way toward transparent and clinically actionable AI systems in neurodegenerative disease monitoring.

Gait recognition has long served as a core biometric technology, and its methodologies are increasingly being adapted for clinical movement analysis. Over the past decade, the field has evolved from purely 2D appearance-based representations to more robust 3D and multimodal formulations. Early work relied heavily on RGB silhouettes, with the Gait Energy Image (GEI) [1], a compact spatio-temporal template provided by averaging silhouettes across a gait cycle. Although these template-based approaches proved influential, they remained sensitive to covariates. Later studies reframed gait as an unordered set of silhouettes [2], enabling more flexible aggregation strategies such as horizontal pyramid pooling. Further improvements in local detail extraction [3], which introduced part-based feature ma

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut