Dual Agreement Consistency Learning with Foundation Models for Semi-Supervised Fetal Heart Ultrasound Segmentation and Diagnosis
Congenital heart disease (CHD) screening from fetal echocardiography requires accurate analysis of multiple standard cardiac views, yet developing reliable artificial intelligence models remains challenging due to limited annotations and variable image quality. In this work, we propose FM-DACL, a semi-supervised Dual Agreement Consistency Learning framework for the FETUS 2026 challenge on fetal heart ultrasound segmentation and diagnosis. The method combines a pretrained ultrasound foundation model (EchoCare) with a convolutional network through heterogeneous co-training and an exponential moving average teacher to better exploit unlabeled data. Experiments on the multi-center challenge dataset show that FM-DACL achieves a Dice score of 59.66 and NSD of 42.82 using heterogeneous backbones, demonstrating the feasibility of the proposed semi-supervised framework. These results suggest that FM-DACL provides a flexible approach for leveraging heterogeneous models in low-annotation fetal cardiac ultrasound analysis. The code is available on https://github.com/13204942/FM-DACL.
💡 Research Summary
The paper addresses the challenging problem of automated fetal cardiac ultrasound analysis for congenital heart disease (CHD) screening, where only a small fraction of the data is annotated due to the high cost of expert labeling and the variability of ultrasound image quality. To tackle this, the authors propose FM‑DACL (Foundation Model Dual Agreement Consistency Learning), a semi‑supervised framework designed for the FETUS 2026 challenge, which requires simultaneous segmentation of cardiac structures and multi‑label classification of CHD from a single B‑mode image.
FM‑DACL combines two heterogeneous backbones: (1) EchoCare, a large‑scale, transformer‑based ultrasound foundation model pretrained on extensive echocardiography data, and (2) a lightweight U‑Net. EchoCare provides strong global feature representations, while U‑Net preserves fine‑grained spatial detail. The two networks are co‑trained on unlabeled data through a cross‑supervision mechanism: each network generates pseudo‑labels (one‑hot encoded segmentation masks and CHD label vectors) for the other, and these pseudo‑labels are used to compute a semi‑supervised loss L_cps consisting of cross‑entropy, Dice, and binary‑cross‑entropy terms. Gradients are not back‑propagated through the pseudo‑label generation step, preventing collapse.
In addition, the authors adopt a mean‑teacher paradigm for the U‑Net. A teacher model (EMA of the student’s weights) produces predictions for two randomly selected unlabeled images; these predictions are mixed using the Mixup operation (σ = 0.5) to create an interpolated sample. The student’s prediction on the mixed image is forced to match the mixed teacher predictions via an L2 interpolation consistency loss L_ict, encouraging smooth behavior in the input space and improving robustness to noise and artifacts.
The most novel component is the Dual‑Agreement Consistency loss L_dac, which enforces agreement between the two backbones at the pixel level. It comprises (i) a KL‑divergence alignment term L_align that minimizes the distributional distance between the categorical probability maps p(u) from EchoCare and q(u) from U‑Net, and (ii) an entropy‑based confidence term L_conf that penalizes disagreement in uncertainty: L_conf = E_u
Comments & Academic Discussion
Loading comments...
Leave a Comment