Multimodal Deep Learning for Stroke Prediction and Detection using Retinal Imaging and Clinical Data
Stroke is a major public health problem, affecting millions worldwide. Deep learning has recently demonstrated promise for enhancing the diagnosis and risk prediction of stroke. However, existing methods rely on costly medical imaging modalities, such as computed tomography. Recent studies suggest that retinal imaging could offer a cost-effective alternative for cerebrovascular health assessment due to the shared clinical pathways between the retina and the brain. Hence, this study explores the impact of leveraging retinal images and clinical data for stroke detection and risk prediction. We propose a multimodal deep neural network that processes Optical Coherence Tomography (OCT) and infrared reflectance retinal scans, combined with clinical data, such as demographics, vital signs, and diagnosis codes. We pretrained our model using a self-supervised learning framework using a real-world dataset consisting of $37$ k scans, and then fine-tuned and evaluated the model using a smaller labeled subset. Our empirical findings establish the predictive ability of the considered modalities in detecting lasting effects in the retina associated with acute stroke and forecasting future risk within a specific time horizon. The experimental results demonstrate the effectiveness of our proposed framework by achieving $5$% AUROC improvement as compared to the unimodal image-only baseline, and $8$% improvement compared to an existing state-of-the-art foundation model. In conclusion, our study highlights the potential of retinal imaging in identifying high-risk patients and improving long-term outcomes.
💡 Research Summary
This paper introduces RetStroke, a multimodal deep learning framework that combines optical coherence tomography (OCT) and infrared reflectance retinal images with routinely collected electronic health record (EHR) data to predict and detect stroke. Recognizing that conventional stroke diagnostics (CT, MRI) are expensive and often unavailable, the authors leverage the anatomical and physiological link between the retina and the brain, hypothesizing that retinal micro‑structural changes can serve as proxies for cerebrovascular health.
The study uses a real‑world dataset from Cleveland Clinic Abu Dhabi spanning March 2015 to July 2023. It comprises 37 000 OCT volumes (each containing 25–49 B‑scans) and corresponding infrared reflectance images, together with 34 static clinical features (demographics, vital signs, ICD‑derived comorbidities). Patients under 18 were excluded; stroke cases were identified via ICD‑10‑CM codes (I60, I61, I62, I63, G45.9) and confirmed with medication orders or length‑of‑stay criteria. Each OCT study was labeled positive if the scan occurred within 365 days of a stroke encounter, negative otherwise. The final split (80 % train, 20 % test) was performed at the patient level to avoid leakage.
RetStroke’s architecture consists of four modules: (i) a visual encoder f_oct (ResNet‑18) that processes OCT and infrared images, (ii) an EHR encoder f_ehr (two‑layer MLP with batch normalization and ReLU) that embeds the clinical vector, (iii) a non‑parametric late‑fusion operator ⊕ that concatenates modality‑specific predictions, and (iv) a final prediction head g_fuse that outputs a stroke probability. Both encoders have their own auxiliary heads (g_oct, g_ehr) to encourage modality‑specific learning before fusion. The model is trained with binary cross‑entropy loss.
Given the limited labeled data, a two‑stage training strategy is employed. First, the visual encoder is pretrained in a self‑supervised manner using SimCLR on 1.1 million unlabeled image patches (all OCT slices plus infrared frames). Harsh augmentations (random resized crop, color jitter, Gaussian blur, flips) and a learnable temperature τ are used to maximize agreement between positive pairs while minimizing it for negatives. AdamW optimizer, cosine annealing, and early stopping (patience = 10) guide the 200‑epoch pretraining. In the second stage, the pretrained encoder is fine‑tuned together with the EHR encoder on the labeled cohort. Hyperparameters (learning rate, weight decay) are searched via random sampling (10 runs).
Evaluation covers three tasks: (1) overall binary classification (stroke before or after scan), (2) risk prediction (stroke occurring after the OCT), and (3) detection of lasting effects (stroke occurring before the OCT). Performance is reported across four prediction horizons (90, 180, 270, 365 days) using AUROC, AUPRC, accuracy, sensitivity, and specificity. RetStroke outperforms an image‑only baseline by +5 % AUROC and surpasses the state‑of‑the‑art foundation model RetFound by +8 % AUROC. The multimodal approach also exceeds a clinical‑only model, demonstrating that retinal imaging adds complementary information beyond traditional risk factors. Feature importance analysis reveals that visual embeddings capture macular thickness and vascular density patterns, while EHR embeddings encode hypertension, diabetes, atrial fibrillation, and smoking status.
Limitations include reliance on a single‑center dataset, potential label noise from ICD‑code based definitions, and possible confounding from ocular diseases or prior eye procedures that affect OCT quality. The authors acknowledge the need for external validation on multi‑institutional, multi‑ethnic cohorts, as well as incorporation of longitudinal EHR trajectories and explainable AI techniques to increase clinical trust.
In conclusion, the study provides strong evidence that a cost‑effective, non‑invasive retinal imaging modality, when fused with routine clinical data, can accurately predict future stroke risk and identify residual retinal signatures of past strokes. This multimodal paradigm holds promise for scalable screening programs, especially in low‑ and middle‑income settings where conventional neuroimaging resources are scarce.
Comments & Academic Discussion
Loading comments...
Leave a Comment