Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening
Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection. Conventional screening methods are often subjective, difficult to scale, and reliant on specialized clinical expertise. Video-based gait analysis offers a promising alternative, but current datasets and methods frequently suffer from data leakage, where performance is inflated by repeated clips from the same individual, or employ oversimplified models that lack clinical interpretability. To address these limitations, we introduce ScoliGait, a new benchmark dataset comprising 1,572 gait video clips for training and 300 fully independent clips for testing. Each clip is annotated with radiographic Cobb angles and descriptive text based on clinical kinematic priors. We propose a multi-modal framework that integrates a clinical-prior-guided kinematic knowledge map for interpretable feature representation, alongside a latent attention pooling mechanism to fuse video, text, and knowledge map modalities. Our method establishes a new state-of-the-art, demonstrating a significant performance gap on a realistic, non-repeating subject benchmark. Our approach establishes a new state of the art, showing a significant performance gain on a realistic, subject-independent benchmark. This work provides a robust, interpretable, and clinically grounded foundation for scalable, non-invasive AIS assessment.
💡 Research Summary
Adolescent idiopathic scoliosis (AIS) is a prevalent spinal deformity that benefits greatly from early detection, yet current screening methods—forward bending tests and scoliometers—are subjective, labor‑intensive, and difficult to scale. Video‑based gait analysis has emerged as a promising non‑invasive alternative, but existing works suffer from three major shortcomings: (1) data leakage, where multiple clips from the same subject appear in both training and test sets, inflating performance; (2) black‑box deep models that provide little clinical insight; and (3) simplistic multimodal fusion that fails to capture the nuanced relationships between gait biomechanics and auxiliary clinical information.
To overcome these issues, the authors present three core contributions. First, they introduce ScoliGait, a new benchmark dataset specifically designed to eliminate data leakage. The dataset contains 1,572 training clips derived from 550 adolescents and a held‑out test set of 300 clips, each belonging to a unique individual unseen during training. Every clip is annotated with a radiographically measured Cobb angle (the clinical gold standard) and a descriptive text prompt generated from statistical analysis of clinically relevant kinematic priors. The labels were verified by senior orthopaedic specialists, ensuring high reliability.
Second, they construct a clinical‑prior‑guided kinematic knowledge map. This structured representation comprises 238 features across three domains: motion space (140 features such as joint trajectories), self‑skeleton space (32 features like inter‑joint distances), and signal cross‑correlation (66 features capturing temporal coordination such as limb synchronization). During inference, the model’s attention scores are projected back onto this map, producing a transparent, time‑resolved heatmap that directly corresponds to clinically meaningful variables (e.g., asymmetric arm swing, pelvic tilt). This mapping transforms the model’s internal reasoning into a human‑readable dictionary, bridging the gap between algorithmic decisions and clinical interpretation.
Third, the authors propose a latent attention pooling (LAP) mechanism for multimodal fusion. Separate encoders process each modality: a Vision Transformer (ViT) for video frames, another ViT for the knowledge map (with a different patch embedding strategy), and a MiniLM‑based Sentence‑Transformer for the textual prompts. A learnable set of latent tokens serves as a dictionary; each token attends to the concatenated multimodal sequence via cross‑attention, effectively summarizing the most expressive aspects of the input. Compared with standard average pooling or simple concatenation, LAP yields richer embeddings while keeping parameter count modest.
Experimental results focus on binary AIS screening (Cobb angle ≥ 10°). In single‑modality experiments, the knowledge‑map model outperforms the raw‑video model, achieving 1.7 % higher accuracy and 3.2 % higher F1, demonstrating that structured clinical features are more discriminative than raw pixels. Multimodal fusion further boosts performance: knowledge‑map + video improves accuracy from 0.64 to 0.70 and F1 from 0.38 to 0.44; the full three‑modality model (knowledge‑map + video + text) with LAP attains the best results (accuracy = 70.0 %, overall F1 = 61.9 %).
When compared against the state‑of‑the‑art ScoNet‑MT (trained on the older Scoliosis1K dataset), the proposed method matches overall accuracy but dramatically improves positive‑class recall (82 % vs. < 10 % for ScoNet‑MT), a critical factor for clinical deployment.
Interpretability analyses show that attention maps over the knowledge map provide fine‑grained, temporally aligned insights into which kinematic variables drive the decision at each gait cycle. For example, a high attention weight on “asymmetric arm swing in thoracic curve types” aligns with established physical examination findings, allowing clinicians to validate the AI’s reasoning directly. This level of transparency surpasses conventional post‑hoc saliency maps that only highlight image regions without clinical context.
Ablation studies confirm the superiority of the LAP fusion (Cat+Latent) over simple concatenation (Cat) and concatenation with a standard attention layer (Cat+Att). Moreover, aligning positional embeddings between video and knowledge‑map streams yields additional gains, underscoring the importance of temporal synchronization across modalities.
In summary, the paper delivers a comprehensive solution for video‑based AIS screening: a leakage‑free, radiographically validated dataset; a clinically grounded kinematic knowledge map that renders model decisions interpretable; and a novel latent attention pooling strategy that efficiently fuses heterogeneous data. The approach not only raises screening accuracy but also provides actionable explanations, making it a viable candidate for real‑world, large‑scale deployment on mobile devices. Future work may explore model compression for on‑device inference and extension of the framework to other musculoskeletal disorders where gait abnormalities are diagnostic cues.
Comments & Academic Discussion
Loading comments...
Leave a Comment