Automated Lesion Segmentation of Stroke MRI Using nnU-Net: A Comprehensive External Validation Across Acute and Chronic Lesions
Accurate and generalisable segmentation of stroke lesions from magnetic resonance imaging (MRI) is essential for advancing clinical research, prognostic modelling, and personalised interventions. Although deep learning has improved automated lesion delineation, many existing models are optimised for narrow imaging contexts and generalise poorly to independent datasets, modalities, and stroke stages. Here, we systematically evaluated stroke lesion segmentation using the nnU-Net framework across multiple heterogeneous, publicly available MRI datasets spanning acute and chronic stroke. Models were trained and tested on diffusion-weighted imaging (DWI), fluid-attenuated inversion recovery (FLAIR), and T1-weighted MRI, and evaluated on independent datasets. Across stroke stages, models showed robust generalisation, with segmentation accuracy approaching reported inter-rater reliability. Performance varied with imaging modality and training data characteristics. In acute stroke, DWI-trained models consistently outperformed FLAIR-based models, with only modest gains from multimodal combinations. In chronic stroke, increasing training set size improved performance, with diminishing returns beyond several hundred cases. Lesion volume was a key determinant of accuracy: smaller lesions were harder to segment, and models trained on restricted volume ranges generalised poorly. MRI image quality further constrained generalisability: models trained on lower-quality scans transferred poorly, whereas those trained on higher-quality data generalised well to noisier images. Discrepancies between predictions and reference masks were often attributable to limitations in manual annotations. Together, these findings show that automated lesion segmentation can approach human-level performance while identifying key factors governing generalisability and informing the development of lesion segmentation tools.
💡 Research Summary
**
This study systematically evaluates the performance and generalizability of the nnU‑Net framework for automated segmentation of stroke lesions across both acute and chronic phases using a wide range of publicly available MRI datasets. The authors trained separate models on diffusion‑weighted imaging (DWI) with accompanying apparent diffusion coefficient (ADC) maps, fluid‑attenuated inversion recovery (FLAIR), and T1‑weighted scans, then tested each model on completely independent cohorts.
For the acute setting, two large datasets were used: the Stroke Onset Optimization Project (SOOP, N = 1 456) and ISLES (N = 250). Models that received DWI + ADC as a two‑channel input achieved a mean Dice coefficient of 0.78 and a 95th‑percentile Hausdorff distance (HD95) of 4.2 mm, substantially outperforming the FLAIR‑only models (Dice 0.71, HD95 6.1 mm). Adding FLAIR to DWI (multimodal input) yielded a modest gain (Dice ≈ 0.80, HD95 ≈ 4.0 mm) but incurred higher computational cost, indicating that DWI already captures the dominant contrast for acute ischemic lesions.
In the chronic phase, three datasets were employed: ATLAS v2.0 (N = 655), the Aphasia Recovery Cohort (ARC, N = 228), and the Cambridge Cognitive Neuroscience Research Panel (CCNRP, N = 204). All models used only the T1‑weighted volume. By progressively increasing the number of training cases (≈ 100, 300, 600), the authors observed a clear improvement in Dice scores (0.73 → 0.78 → 0.81) with diminishing returns after several hundred cases, consistent with the logarithmic learning curves reported for deep segmentation networks.
A detailed analysis of lesion volume revealed that small lesions (< 1 mL) were consistently under‑segmented (Dice < 0.5), whereas larger lesions (> 10 mL) achieved Dice > 0.85. Importantly, models trained on datasets with a narrow volume distribution performed poorly on out‑of‑distribution lesion sizes, underscoring the necessity of a balanced volume spectrum during training.
Image quality was another decisive factor. Models trained on high‑quality, high‑SNR scans (e.g., 3 T Siemens Prisma) generalized well to lower‑quality test data, maintaining Dice ≈ 0.77. Conversely, models trained on noisy, low‑resolution scans failed to transfer to high‑quality images (Dice ≈ 0.65), indicating that nnU‑Net’s automatic preprocessing cannot fully compensate for fundamental signal‑to‑noise differences.
Laterality (left vs. right hemisphere) did not affect performance, suggesting that the network learns symmetric representations without bias.
The authors also inspected failure cases and found that many discrepancies between predictions and reference masks stemmed from inconsistencies or limitations in the manual annotations themselves (e.g., ambiguous borders, missed tiny lesions). This observation highlights that automated segmentation can reach, and in some aspects surpass, the reliability of human raters.
Methodologically, the study adhered to the standard nnU‑Net pipeline (automatic preprocessing, architecture selection between full‑resolution 3D U‑Net and a residual encoder variant, Dice + cross‑entropy loss, five‑fold cross‑validation, post‑processing to remove spurious components). Evaluation employed Dice and HD95, the latter capped at 256 mm to avoid outlier distortion.
Overall, the findings demonstrate that nnU‑Net, despite being a “no‑new‑architecture” solution, achieves human‑level segmentation accuracy across diverse stroke stages, imaging modalities, and acquisition sites. The work identifies three key determinants of generalizability: (1) sufficient and diverse training sample size, (2) inclusion of a broad lesion‑volume range, and (3) training on high‑quality images. The authors release all trained models, code, and documentation under an open‑source license, facilitating reproducibility and encouraging community‑driven improvements.
In conclusion, automated stroke lesion segmentation with nnU‑Net is ready for broader clinical and research adoption, provided that developers curate large, heterogeneous, high‑quality training datasets that reflect the full spectrum of lesion sizes and imaging conditions.
Comments & Academic Discussion
Loading comments...
Leave a Comment