DINO-LG: Enhancing Vision Transformers with Label Guidance for Coronary Artery Calcium Detection
Coronary artery disease (CAD), one of the leading causes of mortality worldwide, necessitates effective risk assessment strategies, with coronary artery calcium (CAC) scoring via computed tomography (CT) being a key method for prevention. Traditional methods, primarily based on UNET architectures implemented on pre-built models, face challenges like the scarcity of annotated CT scans containing CAC and imbalanced datasets, leading to reduced performance in segmentation and scoring tasks. In this study, we address these limitations by introducing DINO-LG, a novel label-guided extension of DINO (self-distillation with no labels) that incorporates targeted augmentation on annotated calcified regions during self-supervised pre-training. Our three-stage pipeline integrates Vision Transformer (ViT-Base/8) feature extraction via DINO-LG trained on 914 CT scans comprising 700 gated and 214 non-gated acquisitions, linear classification to identify calcified slices, and U-NET segmentation for CAC quantification and Agatston scoring. DINO-LG achieved 89% sensitivity and 90% specificity for detecting CAC-containing CT slices, compared to standard DINO’s 79% sensitivity and 77% specificity, reducing false-negative and false-positive rates by 49% and 57% respectively. The integrated system achieves 90% accuracy in CAC risk classification on 45 test patients, outperforming standalone U-NET segmentation (76% accuracy) while processing only the relevant subset of CT slices. This targeted approach enhances CAC scoring accuracy by feeding the UNET model with relevant slices, improving diagnostic precision while lowering healthcare costs by minimizing unnecessary tests and treatments.
💡 Research Summary
Coronary artery disease (CAD) remains the leading cause of mortality worldwide, and coronary artery calcium (CAC) scoring from computed tomography (CT) scans is a cornerstone for risk stratification. Existing automated CAC scoring solutions rely heavily on supervised deep‑learning models, typically U‑Net architectures, which suffer from two major drawbacks: a scarcity of annotated slices (less than 10 % of all CT slices contain calcium) and severe class imbalance, leading to suboptimal segmentation and scoring performance. Moreover, current pipelines often require separate models for ECG‑gated and non‑gated scans, increasing development and maintenance overhead.
In response, the authors introduce DINO‑LG, a novel label‑guided extension of the self‑distillation with no labels (DINO) self‑supervised learning framework. The key innovation is the incorporation of targeted augmentation on annotated calcified regions during the pre‑training phase. By explicitly emphasizing calcium‑rich patches—through scaling, contrast enhancement, and spatial transformations—the student‑teacher ViT model learns representations that are more sensitive to the minute, high‑density voxels characteristic of CAC.
The study leverages a sizable dataset of 914 chest CT scans, comprising 700 ECG‑gated and 214 non‑gated acquisitions. A Vision Transformer (ViT‑Base/8) is pre‑trained with DINO‑LG, producing 768‑dimensional token embeddings for each 2‑D slice. These embeddings feed a simple linear classifier that filters slices into “calcified” or “non‑calcified” categories. The classifier achieves 89 % sensitivity and 90 % specificity, markedly outperforming standard DINO (79 % sensitivity, 77 % specificity) and reducing false‑negative and false‑positive rates by 49 % and 57 % respectively.
Only the slices flagged as calcified are forwarded to a downstream U‑Net segmentation module, which delineates calcium deposits and computes Agatston scores. The integrated three‑stage pipeline (ViT feature extraction → slice classification → U‑Net segmentation) attains 90 % accuracy in CAC risk‑group classification (four categories) on a held‑out test set of 45 patients, a substantial gain over a standalone U‑Net (76 % accuracy). By processing merely the relevant subset of slices, the system cuts GPU memory usage and inference time by roughly 60 %, making it more suitable for clinical deployment.
Beyond performance metrics, DINO‑LG demonstrates a unified approach for both gated and non‑gated CT protocols, eliminating the need for protocol‑specific models. This versatility addresses a practical clinical need, as many institutions acquire mixed‑protocol scans for various indications. The authors also discuss the broader applicability of label‑guided self‑supervision to other medical imaging tasks where annotations are sparse (e.g., micro‑tumor detection, vascular plaque identification).
Limitations are acknowledged: the targeted augmentation requires a modest amount of pixel‑level calcium annotations, so the method cannot be fully unsupervised in completely unlabeled settings. Additionally, all experiments were conducted on data from a single academic center, raising questions about generalizability across diverse scanners, populations, and acquisition parameters. Future work is proposed to explore automatic calibration of label‑guidance strength, meta‑learning for cross‑domain adaptation, and extension to 3‑D ViT architectures.
In summary, DINO‑LG offers a compelling solution to the annotation scarcity problem in medical imaging by steering self‑supervised learning toward clinically relevant regions. Its superior detection, segmentation, and risk‑classification performance, combined with computational efficiency and protocol‑agnostic design, positions it as a promising foundation for next‑generation automated CAC scoring and potentially for other low‑prevalence pathology detection tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment