Placenta Accreta Spectrum (PAS) is a serious obstetric condition that can be challenging to diagnose with Magnetic Resonance Imaging (MRI) due to variability in radiologists' interpretations. To overcome this challenge, a hybrid 3D deep learning model for automated PAS detection from volumetric MRI scans is proposed in this study. The model integrates a 3D DenseNet121 to capture local features and a 3D Vision Transformer (ViT) to model global spatial context. It was developed and evaluated on a retrospective dataset of 1,133 MRI volumes. Multiple 3D deep learning architectures were also evaluated for comparison. On an independent test set, the DenseNet121-ViT model achieved the highest performance with a five-run average accuracy of 84.3%. These results highlight the strength of hybrid CNN-Transformer models as a computer-aided diagnosis tool. The model's performance demonstrates a clear potential to assist radiologists by providing a robust decision support to improve diagnostic consistency across interpretations, and ultimately enhance the accuracy and timeliness of PAS diagnosis.
Placenta Accreta Spectrum (PAS) represents life-threatening obstetric conditions defined by abnormal placental invasion of uterine wall. Its incidence has risen dramatically in recent decades due to rise in cesarean deliveries and the resulting scar tissues [1,2]. Estimates indicate that PAS cases have doubled over the last two decades, making PAS a growing health challenge [3]. The condition carries high risks, including severe hemorrhage, infection, and frequent need for peripartum hysterectomy, contributing significantly to maternal morbidity and mortality [2]. Early and precise prenatal diagnosis is vital for reducing these risks and allowing multidisciplinary management, which improves outcomes [1].
Current diagnostic practice combines clinical risk assessment with imaging, primarily ultrasound (US) and magnetic resonance imaging (MRI) [4]. Although, US imaging is commonly used for initial screening it has limitations in accurately assessing invasion depth and extent, especially with posterior placenta or bowel involvement. MRI offers complementary value through superior soft tissue contrast and larger field of view, providing detailed evaluation of invasion depth and adjacent organ involvement [2], [4]. Among different MRI sequences, T2-weighted imaging (T2WI) is most common for placental assessment and T1-weighted imaging (T1WI) reflect bleeding conditions [5]. Key MRI signs of PAS include T2-dark intraplacental bands, focal interruption of myometrial border, abnormal vascularity (Fig. 1) [6,7]. However, the interpretation of these complex imaging features remain qualitative, requires significant radiological expertise, and prone to inter-observer variability [8], presenting the need for objective diagnostic methods. Deep learning methods like convolutional neural networks (CNNs) have shown notable success in learning complex patterns from raw imaging data, often exceeding human performance in classification and segmentation tasks [9]. Although CNNs perform well in capturing local features but they often struggle to grasp the global context. Vision Transformers (ViTs) on the other hand uses self-attention mechanisms to effectively extract these global relationships [10]. This has led to the development of hybrid CNN-Transformer models with the aim to combine the complementary strengths of each approach and show strong performance in complex medical imaging tasks [11][12][13][14].
While deep learning has been applied to PAS detection from MRI, systematic studies of 3D architectures remain limited. The optimal model design for capturing both local and global 3D PAS markers is still undefined. This study addresses this gap by proposing a tailored 3D hybrid DenseNet121-ViT model for end-to-end volumetric PAS detection. A systematic comparison of six modern 3D deep learning models was conducted on a novel dataset of 3D MRI. The study demonstrates that integration of dense local feature extraction and global self attention results in superior diagnostic performance.
Computational MRI analysis for PAS has evolved from handcrafted feature engineering to end-toend deep learning frameworks, with research moving to automated feature extraction [15]. This reflects broader trends in medical imaging towards automated diagnostic tools. This section reviews prior studies on PAS detection and highlights the performance of hybrid deep learning models in related medical imaging fields.
Radiomics refers to the automated extraction of quantitative imaging features [15]. Studies show that radiomic features from T2WI MRI can predict PAS with high accuracy. These features including tissue texture, shape, and intensity are extracted using mathematical formulas defined by experts to train machine learning (ML) models. Romeo et al. analyzed MRI-derived texture features from 64 scans (20 PAS) using four ML algorithms, with a k-nearest neighbors classifier achieving 98.1% accuracy [16]. Leitch et al. used MRI radiomic features for predicting PAS and hysterectomy risk using 241 scans (141 PAS). After manual uterine and placental segmentation, 17 ML algorithms were tested, achieving 88-92% accuracy [17]. A meta-analysis of seven radiomics studies (672 patients) reported pooled sensitivity of 87%, specificity of 92%, and area under the receiver operating characteristic curve (AUC) of 0.93, highlighting strong diagnostic potential [18]. However, feature robustness across different segmentation methods is still a challenge. One study found that using a 3D volume of interest of the retroplacental myometrium resulted the best PAS prediction, confirming the importance of segmentation choice [19]. [5].
Recent approaches have employed deep learning for both segmentation and classification using 3D MRI. While Wang et al. achieved a high AUC of 0.897 using a 2D DenseNet-based PAS classifier on 540 cases (170 PAS), their approach relied on inputs from separate 3D segmentation step with 3nnU-Net and a placental position classifier [22]. Another study used a 3D U-Net 3+ for automatic segmentation of the placenta and uterine cavity on 244 MRI scans, achieving a Dice score of 82.7% and 91.8% respectively [23]. Further research classified PAS subtypes with a two-branched CNN and 414 MRI scans, achieving an AUC of 0.8 for multi-class prediction [24]. Another study developed a CascadeNet model integrating 3D MRI, radiomic and topographic feature to predict hysterectomy risk, with AUC of 0.878 and 83.3% accuracy on 241 MRI volumes [25]. These highlight the potential of fully automated, deep learning systems for PAS assessment.
More recent advances aim to overcome limitations of traditional CNNs through hybrid CNN-ViT models, combining the strengths of both [10]. These have shown promise in complex medical imaging tasks using MRI, including Alzheimer’s diagnosis [11,12] and brain tumor classification [13,14].
While deep learning has proven effective for MRI-based PAS detection, most studies use small data sample and 2D CNNs. Given the inherent volumetric nature of PAS, this study proposes an end-to-end 3D hybrid CNN-Transformer model that analyzes entire MRI volumes to capture local and global spatial relationships that 2D slice-based or pure 3D CNN analysis may miss. The aim is to assess if a fully volumetric end-to-end approach can perform competitively without needing prior segmentation. Additionally, a comparison with other 3D architectures is performed to validate the superiority of the proposed approach.
This study employed a retrospective MRI dataset to develop and evaluate 3D deep learning models for PAS classification. The following subsections describe the dataset and model architectures used.
This retrospective study used patient data from the Fetal Medicine Department at King Abdulaziz University Hospital, with all procedures adhering to strict institutional guidelines and ethical standards. Formal ethical approval was obtained from the hospital’s Research Ethics Committee.
An initial query of the hospital’s radiology information system identified a large number of patients who underwent MRI evaluation for suspected PAS. Cases were screened based on diagnostic confirmation and image quality. Exclusion criteria included incomplete imaging, motion-degraded scans, and uncertain diagnostic outcomes. The final dataset consisted of 1,133 T2WI MRI scans, including confirmed 853 normal (non-PAS) and 280 PAS cases, ensuring representation of both diagnostic classes for robust and generalizable deep learning model development. Fig. 2 illustrates example slices from the dataset.
In order to extract both local and global contextual features of PAS markers from volumetric MRI, a hybrid 3D DenseNet121-ViT model was developed. A 2D DenseNet-Transformer combination has previously shown promise in medical image classifications [13]. The proposed hybrid architecture is shown in Fig. 3. The 3D DenseNet121 component processes the MRI volume to extract local and textural features through a series of dense blocks. A global average pooling layer is applied to the final feature map, followed by a fully connected layer to produce a 128-dimensional embedding. In A systematic evaluation of multiple 3D deep learning architectures was performed to compare with the proposed architecture. The tested 3D models included: ResNet18, DenseNet121, EfficientNet-B0 and a Swin-Transformer, all known for their effectiveness in medical imaging tasks. The 3D ResNet18 model was pretrained using MedicalNet [26]. In addition, another hybrid architecture was developed: 3D ResNet18-Swin, selected based on evidence from recent studies [11,12]. To reduce overfitting, dropout regularization was applied, with dropout rates between 10%-50% evaluated. The optimal dropout rates were found to be 50% for 3D DenseNet121-ViT and 10% for 3D ResNet18.
This section provides an overview of the experimental workflow, including data preprocessing, model training, and evaluation. The overall workflow for MRI-based PAS diagnosis is illustrated in Fig. 4.
Raw 3D MRI data were initially stored in the Digital Imaging and Communications in Medicine (DICOM) format as a series of 2D slices. Each DICOM series was converted into the Neuroimaging Informatics Technology Initiative (NIfTI) format to create a single 3D volumetric file suitable for 3D CNN input. Volumes were reoriented to a standard (height, width, depth) order to eliminate variability from differing scan orientations. All volumes were resized to a fixed dimension of 128ร128ร64 voxels using cubic interpolation, with zero-padding to preserve aspect ratio. Finally, voxel intensities were normalized to the range [0, 1] using per scan min-max scaling to reduce scanner-dependent variability while preserving relative intensity patterns within each volume. This standardized preprocessing ensures consistent input sizes for all scans (see Fig. 5). A noticeable class imbalance was present in the dataset, with 853 normal (denoted as 0) and 280 PAS (denoted as 1) scans. The dataset was split into training (70%, n=793), validation (10%, n=113), and independent test sets (20%, n=227) using stratified sampling to prevent sampling bias and preserve original class distribution across all subsets. it was also ensured that no patient appeared in multiple subsets. The detailed distribution is presented in Table 1. The class imbalance in the training set was addressed by augmenting and oversampling the minority class (PAS). The PAS samples were increased from 196 to 597 samples, resulting in a balanced training set of 1,194 volumes. The geometric augmentations applied included random flips along the height and width axes, rotations (90ยฐ, 180ยฐ, 270ยฐ), and zooming (factors of 1.1-1.3).
All experiments were conducted using the PyTorch and MONAI framework [27]. Models were trained for 100 epochs and validation performance was monitored after every epoch. The weights achieving the highest validation accuracy were saved for final evaluation on the independent test set using standard binary classification metrics to ensure that the reported performance reflect the model’s ability to generalize to unseen data. Adam optimizer was used for all experiments due to its adaptive learning rate. To ensure a fair comparison, hyperparameters were systematically tuned and the final settings used for 2.
The diagnostic performance of the six evaluated 3D deep learning architectures was assessed on the test set using multiple evaluation metrics: Accuracy, AUC, Precision, Recall, and F1-Score. Each model was trained and tested in five independent runs and the average performance was reported.
The results are summarized in Table 3 showing a clear performance hierarchy between the models and highlighting the superiority of the proposed hybrid 3D DenseNet121-ViT architecture in all evaluation metrics. The hybrid 3D DenseNet121-ViT model yielded the best overall performance. It achieved a five-run average accuracy of 84.3ยฑ1.3% and an average AUC of 0.842ยฑ0.012. This model also demonstrated balanced performance between precision (0.790ยฑ0.013) and recall (0.842ยฑ0.013), resulting in the highest F1-Score of 0.808ยฑ0.014. The low standard deviation (SD) between runs indicates strong stability and robustness, a critical attribute for potential clinical translation. For its best-performing run, the model reached 98.6% training and 91.2% validation accuracy at peak validation performance over 100 epochs (Fig. 6A). The corresponding test accuracy was 85.0% with an AUC of 0.862. The confusion matrix indicated that it correctly identified 144 out of 171 normal cases (0) and 49 out of 56 PAS cases (1) in the test set (Fig. 6B). The second best performance was of pure 3D DenseNet121 (accuracy 79.5ยฑ2.0) and closely matched by 3D ResNet18 (accuracy 79.3ยฑ1.3). The remaining architectures resulted in comparatively lower performance, with accuracies ranging from 63% to 70%. Fig. 7 presents the receiver operating characteristic (ROC) curve corresponding to the best-performing runs of each MRI-based model. To assess the statistical significance of performance differences among the six models, repeated measures analysis of variance (ANOVA) was conducted, followed by post-hoc paired t-tests. The resulting p-values were corrected for multiple comparisons using the Benjamini-Hochberg procedure to control the false discovery rate (FDR). The analysis confirmed that the DenseNet121-ViT model outperformed the other architectures across all metrics (p < 0.05). Similarly, DenseNet121 and ResNet18 performed significantly better than EfficientNet-B0, ResNet18-Swin, and Swin-Transformer. No significant differences were observed between DenseNet121 and ResNet18, or between ResNet18-Swin and Swin-Transformer (p > 0.05). Significant pairwise differences based on accuracy are summarized in Table 4.
Table 4: Pairwise t-test results (accuracy) between the models (ฮฑ = 0.05). Each cell shows the FDR corrected p-value and significance (โ= significant, ร = not significant). The proposed model’s success can be attributed to its ability to harness the complementary strengths of its two components. The DenseNet121 component, through dense connectivity and feature reuse, may be effective at identifying local fine-grained textures such as T2-dark intraplacental bands (Fig. 1A). Meanwhile, the ViTs global self-attention is well-suited for modeling long-range spatial relationships, such as the overall shape and integrity of the myometrial border across the uterine volume (Fig. 1B). In essence, the dual-scale processing imitates an expert radiologist’s approach, explaining its superior performance compared to the single-scale CNN or Transformer models. Notably, the performance of the ResNet18-Swin hybrid, which was worse than the standalone ResNet18, suggests that simply combining a CNN and a Transformer is not a guaranteed formula for success. The effectiveness of a hybrid model is critically dependent on the specific pairing and fusion strategy. This failure may be due to a mismatch between ResNet18 and Swin-Transformer features or the combined model’s parameter count was too large causing overfitting on the dataset. This highlights that the DenseNet121-ViT combination proved particularly effective for this task.
Compared with prior MRI-based PAS studies, the proposed model’s results are competitive (average AUC 0.842, best-run AUC 0.862). Other studies for PAS diagnosis with deep learning have reported external validation AUCs from 0.849 to 0.897. However, these often relied on handcrafted features, 2D analysis, or limited datasets (< 550). This study extends the literature by demonstrating the effectiveness of an end-to-end volumetric hybrid model developed on a relatively larger dataset of > 1, 100 MRI volume to allow robust learning and generalization while avoiding biases from manual feature engineering. Thus, the fully automated design and 3D approach represent a more clinically scalable solution.
This study presented a 3D DenseNet121-ViT model to effectively classify PAS on MRI data and achieve robust accuracy and AUC. The proposed model offers a key advantage over 2D slice-based methods by performing an end-to-end volumetric analysis without losing important global context. The combination of local feature extraction and global context allows the hybrid DenseNet121-ViT framework to outperform standalone 3D CNNs. This represents an important step in improving automated prenatal image analysis. The proposed model as a computer-aided diagnosis instrument demonstrates a strong performance, indicating its clinical relevance. Clinically it can assist radiologists by providing objective risk assessments, improve diagnostic consistency while reducing the observer variability that currently challenges PAS diagnosis. In short, such a tool would be clinically deployed through an integration with the hospital’s radiology information system software. The model may be used in this workflow to analyze MRI volumes automatically for screening prenatal high-risk patients with PAS. The radiologist would then be alerted with a positive or high-risk classification to act as an effective diagnostic aid to support the final diagnosis and the multidisciplinary surgical planning. Some limitations should be noted in this study. The retrospective, single institution dataset may introduce selection bias and limit generalizability across scanners and populations. Future research should focus on multi-center validation and integration of multimodal data. Finally, the “black box” nature of the model could be a barrier to clinical adoption. Therefore,the future work will involve the creation of interpretability methods, including visualizations that show which parts of the image most affected the model’s decision, which is essential in creating clinical trust in such an automated system. By addressing these challenges, hybrid deep learning approaches hold promise for advancing non-invasive, accurate prenatal PAS diagnosis and reliable automated clinical decision support.
This content is AI-processed based on open access ArXiv data.