Reliable interpretation of cardiac ultrasound images is essential for accurate clinical diagnosis and assessment. Self-supervised learning has shown promise in medical imaging by leveraging large unlabelled datasets to learn meaningful representations. In this study, we evaluate and compare two self-supervised learning frameworks, USF-MAE, developed by our team, and MoCo v3, on the recently introduced CACTUS dataset (37,736 images) for automated simulated cardiac view (A4C, PL, PSAV, PSMV, Random, and SC) classification. Both models used 5-fold cross-validation, enabling robust assessment of generalization performance across multiple random splits. The CACTUS dataset provides expert-annotated cardiac ultrasound images with diverse views. We adopt an identical training protocol for both models to ensure a fair comparison. Both models are configured with a learning rate of 0.0001 and a weight decay of 0.01. For each fold, we record performance metrics including ROC-AUC, accuracy, F1-score, and recall. Our results indicate that USF-MAE consistently outperforms MoCo v3 across metrics. The average testing AUC for USF-MAE is 99.99% (+/-0.01% 95% CI), compared to 99.97% (+/-0.01%) for MoCo v3. USF-MAE achieves a mean testing accuracy of 99.33% (+/-0.18%), higher than the 98.99% (+/-0.28%) reported for MoCo v3. Similar trends are observed for the F1-score and recall, with improvements statistically significant across folds (paired t-test, p=0.0048 < 0.01). This proof-of-concept analysis suggests that USF-MAE learns more discriminative features for cardiac view classification than MoCo v3 when applied to this dataset. The enhanced performance across multiple metrics highlights the potential of USF-MAE for improving automated cardiac ultrasound classification.
Cardiac ultrasound (echocardiography) is a cornerstone imaging modality in both adult and fetal cardiology due to its real-time assessment of cardiac anatomy and function, lack of ionizing radiation, and widespread clinical availability. However, reliable interpretation of echocardiographic images requires extensive training and experience, and manual annotation of cardiac views is both timeconsuming and subject to inter-observer variability [1]. The complexity of cardiac ultrasound interpretation is further amplified in fetal imaging, where small cardiac structures and variable fetal positions increase the difficulty of consistent view identification.
Deep learning has demonstrated substantial promise for automating medical image analysis, achieving expert-level performance on diverse tasks across modalities such as radiography, ultrasound [2][3][4][5][6][7][8], and magnetic resonance imaging [9]. Traditional supervised learning approaches rely heavily on large quantities of labelled data, which are often scarce and costly to obtain in clinical settings. In contrast, self-supervised learning (SSL) enables models to leverage vast collections of unlabelled medical images to learn useful representations, subsequently fine-tuning on smaller annotated datasets with improved performance and annotation efficiency [9,10].
Recent trends in SSL include both contrastive methods and generative pretext tasks. Methods such as momentum contrast (MoCo) [11] exploit a contrastive learning objective to cluster semantically similar examples in embedding space without labels [13]. Masked autoencoder (MAE) approaches [14], originally introduced for natural images, learn to reconstruct masked portions of inputs, effectively encouraging models to capture rich, global structure in image representations. These pre-training strategies have been adapted to medical imaging tasks, and empirical evidence suggests that in-domain SSL pre-training can yield better downstream performance than models initialized on natural images alone [2,15].
In the domain of cardiac view classification, prior work has applied contrastive SSL to learn discriminative features from echocardiograms, improving classification accuracy relative to purely supervised baselines [1]. More recently, the CACTUS dataset was introduced as the first open, graded large-scale dataset of cardiac ultrasound views, encompassing multiple standard views and random images to support both classification and quality assessment tasks [16]. The availability of CACTUS enables rigorous evaluation of self-supervised models tailored specifically for cardiac ultrasound representations.
Despite this progress, there remains a need to systematically benchmark SSL frameworks on CACTUS to identify the most effective pre-training paradigms for cardiac ultrasound view classification. Furthermore, foundation models trained on large, diverse unlabeled ultrasound corpora may offer improved transferability compared with SSL models pre-trained on smaller or natural image datasets. In this study, we evaluate and compare our recently published ultrasound selfsupervised foundation model with masked autoencoding (USF-MAE) [2] with MoCo v3 [12] on the CACTUS dataset [16]. This comparison serves as a proof-ofconcept (POC) to assess whether large-scale, domain-specific self-supervised pretraining yields more discriminative features than contrastive learning for cardiac view classification, an essential first step towards downstream applications such as congenital heart defects (CHD) detection.
We conducted all experiments using the publicly available CACTUS dataset [16], which contains 37,736 expert-annotated cardiac ultrasound images generated by a phantom across six classes: apical four-chamber (A4C), parasternal long-axis (PL), parasternal short-axis aortic valve (PSAV), parasternal short-axis mitral valve (PSMV), Random, and subcostal four-chamber (SC) views (Table 1). The Random views class includes non-standard or non-diagnostic frames, increasing classification difficulty and better simulating real-world variability.
To ensure robust evaluation, we adopted a stratified 5-fold cross-validation protocol such that each of the six classes was equally represented in every fold. In each fold, four splits were used for training, and the remaining split was used for testing.
Ultrasound images often contain sector-shaped acquisition regions and colour annotations or markers, as shown in Fig. 1A. To standardize the input and reduce irrelevant visual artifacts, we applied a three-stage preprocessing pipeline.
Sector masking and cropping. Each image was first converted to RGB, and a sector mask was applied to isolate the ultrasound field of view. The mask was defined using a pie-slice geometry centred at the bottom midpoint of the image with angular limits of 210 • to 330 • and radius equal to 90% of the image height. Pixels outside this region were set to zero. The masked image was then cropped using f
This content is AI-processed based on open access ArXiv data.