A generalizable large-scale foundation model for musculoskeletal radiographs

A generalizable large-scale foundation model for musculoskeletal radiographs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Artificial intelligence (AI) has shown promise in detecting and characterizing musculoskeletal diseases from radiographs. However, most existing models remain task-specific, annotation-dependent, and limited in generalizability across diseases and anatomical regions. Although a generalizable foundation model trained on large-scale musculoskeletal radiographs is clinically needed, publicly available datasets remain limited in size and lack sufficient diversity to enable training across a wide range of musculoskeletal conditions and anatomical sites. Here, we present SKELEX, a large-scale foundation model for musculoskeletal radiographs, trained using self-supervised learning on 1.2 million diverse, condition-rich images. The model was evaluated on 12 downstream diagnostic tasks and generally outperformed baselines in fracture detection, osteoarthritis grading, and bone tumor classification. Furthermore, SKELEX demonstrated zero-shot abnormality localization, producing error maps that identified pathologic regions without task-specific training. Building on this capability, we developed an interpretable, region-guided model for predicting bone tumors, which maintained robust performance on independent external datasets and was deployed as a publicly accessible web application. Overall, SKELEX provides a scalable, label-efficient, and generalizable AI framework for musculoskeletal imaging, establishing a foundation for both clinical translation and data-efficient research in musculoskeletal radiology.


💡 Research Summary

This paper introduces SKELEX, the first large‑scale foundation model specifically designed for musculoskeletal (MSK) radiographs. The authors assembled an unprecedented dataset of 1,296,540 unlabeled X‑ray images collected from Seoul National University Hospital, covering 15 anatomical regions and 89 distinct MSK conditions. Using a two‑stage self‑supervised learning pipeline, they first initialized a masked autoencoder (MAE) with ImageNet‑1K weights and then fine‑tuned it on the MSK dataset with a high masking ratio, forcing the network to reconstruct missing patches. This training strategy enables the model to learn both low‑level visual features and high‑level pathological cues directly from clinical data, without any manual annotations.

The pretrained backbone was evaluated on twelve downstream diagnostic tasks spanning fracture detection (FracAtlas), bone tumor classification (BTXRD), osteoarthritis grading (OAI), pes planus identification, and several other conditions, using seven publicly available datasets. SKELEX consistently outperformed two strong baselines—ResNet‑101 pretrained on ImageNet‑1K and ViT‑L/21K pretrained on ImageNet‑21K—by an average of 2.40 % (vs. ResNet) and 3.89 % (vs. ViT) in AUROC. Notably, in bone‑tumor classification SKELEX achieved an AUROC of 0.954, far exceeding the ~0.90 scores of the baselines. Performance remained robust across anatomical sites and disease sub‑types, even for rare categories where baseline models faltered.

A key contribution is the demonstration of label efficiency. By progressively reducing the amount of annotated data, the authors showed that SKELEX can reach baseline performance with only 10 % of the original training labels, highlighting its ability to learn effectively from limited supervision.

Beyond classification, the model can generate zero‑shot abnormality localization maps. Because the MAE learns to reconstruct masked regions, the pixel‑wise difference between the reconstructed and original image highlights areas that deviate from learned normal anatomy. These “error maps” were produced without any additional training and successfully highlighted fractures, tumors, and osteoarthritic changes across multiple datasets, with statistically higher mean error values for abnormal images.

Building on this capability, the authors devised a region‑guided, multi‑head framework for bone‑tumor diagnosis. The system first detects anatomical regions (e.g., femur, pelvis) and then applies region‑specific classifiers that simultaneously predict the presence of tumors, fractures, and implants. A label‑masking strategy was employed to handle missing annotations in public datasets. The multi‑head model achieved AUROC = 0.998 for region detection and >0.95 for each abnormality class in five‑fold cross‑validation, and it generalized well to external datasets (BTXRD‑Center 2/3, Radiopaedia, MedPix). Importantly, the model inferred detailed findings (e.g., fractures) that were absent from the original dataset labels, and these inferences aligned closely with expert orthopedic notes, demonstrating interpretability and clinical relevance.

To showcase translational potential, the authors deployed a web‑based interface where users can upload an X‑ray and receive probability scores for bone tumors and fractures together with the corresponding error maps. This prototype illustrates how a foundation model can be turned into an accessible diagnostic aid.

In discussion, the authors argue that domain‑specific self‑supervised pretraining on massive, uncurated clinical data yields representations that are more suitable for MSK imaging than transferring from natural‑image models. They acknowledge limitations: the primary dataset originates from a single Korean institution, external validation on more diverse international cohorts is needed, and hyper‑parameter choices (masking ratio, loss weighting) were not exhaustively explored. Future work should address multi‑center validation, fine‑tune masking strategies, and investigate integration with other imaging modalities.

Overall, SKELEX demonstrates that a large‑scale, self‑supervised foundation model can achieve state‑of‑the‑art performance across a wide spectrum of musculoskeletal radiographic tasks, provide zero‑shot localization, operate efficiently with scarce labels, and be packaged into a usable clinical tool. This work paves the way for broader adoption of foundation‑model paradigms in orthopedic imaging and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment