Scale-aware Adaptive Supervised Network with Limited Medical Annotations

Graphical Abstract Scale-aware Adaptive Supervised Network with Limited Medical Annotations Zihan Li, Dandan Shan, Y unxiang Li, Paul E. Kinahan, Qingqi Hong ＋ 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 . 5 1 . 0 1 . 0 1 . 5 1 . 0 1 . 0 1 . 6 0 . 7 1 . 5 1 . 0 1 . 0 0 . 5 1 . 0 1 . 0 0 . 4 1 . 3 1 0 0 0 0 0 0 1 Wi t hou t SA R ＋＋ 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 . 2 0 . 2 0 . 8 0 . 8 0 . 2 0 . 8 0 . 8 0 . 3 0 . 8 0 . 8 0 . 2 0 . 2 0 . 8 0 . 2 0 . 2 0 . 7 0 . 3 0 . 8 0 . 2 0 . 7 0 . 8 0 . 2 0 . 8 0 . 4 0 . 7 0 . 2 0 . 8 0 . 3 0 . 2 0 . 8 0 . 2 0 . 6 1 . 4 1 . 4 1 . 1 1 . 1 1 . 1 1 . 4 1 . 4 1 . 15 0 . 7 0 . 2 0 . 8 0 . 3 0 . 8 0 . 2 0 . 8 0 . 4 0 . 28 0 . 28 0 . 88 0 . 88 0 . 22 1 . 12 1 . 12 0 . 345 1 . 12 1 . 12 0 . 22 0 . 22 0 . 88 0 . 28 0 . 28 0 . 805 0 . 405 0 . 88 0 . 28 0 . 805 1 . 12 0 . 22 1 . 12 0 . 48 0 . 945 0 . 22 1 . 12 0 . 345 0 . 28 0 . 88 0 . 28 0 . 72 0 . 685 1 . 16 1 . 16 1 . 685 1 . 34 1 . 34 2 . 24 0 . 825 2 . 065 1 . 34 1 . 34 0 . 565 1 . 16 1 . 16 0 . 56 1 . 525 1 1 1 0 0 0 0 1 With S A R ＋                                                                                                     0 . 2 0 . 2 0 . 8 0 . 8 0 . 2 0 . 8 0 . 8 0 . 3 0 . 8 0 . 8 0 . 2 0 . 2 0 . 8 0 . 2 0 . 2 0 . 7                     0 . 3 0 . 8 0 . 2 0 . 7 0 . 8 0 . 2 0 . 8 0 . 4 0 . 7 0 . 2 0 . 8 0 . 3 0 . 2 0 . 8 0 . 2 0 . 6                      0 . 8 0 . 8 0 . 2 0 . 2 0 . 2 0 . 8 0 . 8 0 . 3          1 . 35 1 . 1 1 . 4 1 . 15 1 . 4 1 . 1 1 . 4 1 . 2       0 . 8 0 . 8 0 . 2 0 . 2 0 . 2 0 . 8 0 . 8 0 . 3 0 . 7 0 . 2 0 . 8 0 . 3 0 . 8 0 . 2 0 . 8 0 . 4 GT                                                                 Sh a re d E n c od er Lo w - l e vel De c od e r SD M Dow n Sam p lin g Co n vB lo c k Up C on vB lo c k S c ale - a w ar e A d ap tiv e R e w eigh t S c ale - a w ar e A dapt i v e R e w eigh t ( S AR ) Gr ound T ruth Softma x C on f i denc e ma t ri x - Im ag e C on f i denc e ma t ri x (a) (b) FT I F T Freq u en cy D o m ai n V ie w V ariance E nhance ment E n h an ced Im ag e Hi gh - l e vel De c od e r DT ~D T FB No r m S DM a l g o r it h m Ov er vie w of S ASNe t Di agr am of SDM Di agr am of Sc al e - a w ar e Adap ti v e R e w eigh t Highlights Scale-aware Adaptive Supervised Network with Limited Medical Annotations Zihan Li, Dandan Shan, Y unxiang Li, Paul E. Kinahan, Qingqi Hong • W e propose SASNet, a dual-branch semi-supervised segmentation network that adaptiv ely fuses multi-scale features to improv e performance under limited annotation. • A scale-a ware adapti ve re weight strate gy is introduced to generate more reliable ensemble predictions by selec- tiv ely fusing pixel-wise results. • A view v ariance enhancement mechanism simulates annotation di ﬀ erences across views and scales, improving robustness and se gmentation accuracy . Scale-aw are Adaptiv e Supervised Network with Limited Medical Annotations Zihan Li a,b,1 , Dandan Shan a,1 , Y unxiang Li c , Paul E. Kinahan b , Qingqi Hong a, ∗ a Xiamen University , Xiamen, 361005, China b University of W ashington, Seattle, W A 98195, USA c Department of Radiation Oncology , UT Southwestern Medical Center , Dallas, TX 75235, USA Abstract Medical image segmentation faces critical challenges in semi-supervised learning scenarios due to severe annotation scarcity requiring expert radiological kno wledge, signiﬁcant inter-annotator variability across di ﬀ erent vie wpoints and expertise le vels, and inadequate multi-scale feature integration for precise boundary delineation in complex anatomi- cal structures. Existing semi-supervised methods demonstrate substantial performance degradation compared to fully supervised approaches, particularly in small target segmentation and boundary reﬁnement tasks. T o address these fundamental challenges, we propose SASNet (Scale-aware Adapti ve Supervised Network), a dual-branch architec- ture that leverages both low-le vel and high-level feature representations through novel scale-aw are adaptiv e reweight mechanisms. Our approach introduces three key methodological innov ations, including the Scale-aware Adaptiv e Reweight strategy that dynamically weights pixel-wise predictions using temporal conﬁdence accumulation, the V ie w V ariance Enhancement mechanism employing 3D Fourier domain transformations to simulate annotation variability , and segmentation-regression consistency learning through signed distance map algorithms for enhanced boundary pre- cision. These inno vations collectiv ely address the core limitations of existing semi-supervised approaches by inte grat- ing spatial, temporal, and geometric consistency principles within a uniﬁed optimization framework. Comprehensive ev aluation across LA, Pancreas-CT , and BraTS datasets demonstrates that SASNet achiev es superior performance with limited labeled data, surpassing state-of-the-art semi-supervised methods while approaching fully supervised performance lev els. The source code for SASNet is av ailable at https: // github.com / HU ANGLIZI / SASNet. K e ywords: Semi-supervised learning, Medical image segmentation, Scale-a ware learning 1. Introduction In medical image segmentation, semi-supervised learning is crucial because high-quality dense annotations are both expensi ve and limited. At the same time, due to di ﬀ erences in the le vel of annotations, there may be some domain o ﬀ set between them, resulting in less label information that can be used [1]. Therefore, more and more researchers ∗ Corresponding author Email address: hongqq@xmu.edu.cn (Qingqi Hong) 1 Zihan Li and Dandan Shan hav e equal contribution to this work. GT High-level branch Low-level branch Dual-branch V iew1 V iew2 V iew3 GT Figure 1: Comparison of segmentation results between di ﬀ erent branch networks and multi-view inputs in the dual-branch network. The dual- branch network structure performs better in some details and is closer to the Ground Truth v alues than the low-le vel branch and high-level branch approach. Additionally , our model exhibits distinct prediction styles across di ﬀ erent views, akin to the natural v ariability among annotations. hav e begun to combine semi-supervised learning with medical image segmentation in recent years. Among them, Chaitanya et al. [2] designed a local contrast loss to help the model learn target features and generate better pseudo labels. In addition, many researchers have also begun to study the introduction of regularization terms into the loss function to help improve the performance of semi-supervised learning. Luo et al. [3] constructed a joint framew ork that utilizes CNN and Transformer structure to learn di ﬀ erent features of images and uses their prediction results for mutual supervision. Y ou et al. [4] used the teacher network and student network to calculate the comparative loss between the predicted results of the two networks. At the same time, researchers are also attempting to utilize multi-scale learning to improv e the performance of medical image segmentation. Liu et al. [5] directly input data of di ﬀ erent scales into the encoder and output prediction results of di ﬀ erent scales in the decoder section. W ang et al. [6] dev eloped a multi-scale fusion module, which helps to fuse spatial information of di ﬀ erent scales through route con volution. But researchers often ignore the role of multi-scale learning in semi-supervised learning [7 – 9], so we design a ne w multi-scale learning paradigm— Scale-aware Adaptive Learning and innovati vely introduce it into semi-supervised learning. As we mentioned above, there are still many unresolved issues in semi-supervised medical image segmentation. Firstly , the challenge of missing labels remains a signiﬁcant issue, as semi-supervised methods continue to demon- strate lower performance in comparison to fully supervised approaches. Speciﬁcally , the segmentation of small tar gets and boundaries remains suboptimal, so we decide to further explore multi-scale information in the data to compensate for the missing labels. Incorporating feature information at multiple scales can enhance the model’ s ability to identify small targets and reﬁne segmentation boundaries. Secondly , the problem of annotation variance poses a unique 2 challenge in the ﬁeld of medical imaging, gi ven the v arying focus. The di ﬀ erence in annotations can exacerbate the challenge of limited labeled data, especially when we need to reﬁne predictions of boundaries. T o address the above issues, we propose an Adaptive Supervised Hierarchical Network based on Scale In v ari- ance. Speciﬁcally , to enhance the model’ s proﬁciency in learning multi-scale information, we design two di ﬀ erent branches, one focusing on low-le vel features and the other on high-lev el features. Unlike previous multi-scale learning networks, we do not use multi-scale data input to di ﬀ erent encoders but instead utilize multi-scale encoding features input to di ﬀ erent decoders. W e believ e that it can obtain high-quality encoding features. In addition, directly utilizing encoding features at di ﬀ erent scales can more intuitiv ely utilize the information at di ﬀ erent scales and display their common areas of interest as shown in Fig. 1, which we call it Scale In variance . The ﬁgure illustrates that both low-le vel and high-lev el features are utilized separately , ignoring the predicted connectivity information. Using them simultaneously can achieve the precise localization of the target, and enrich the model with more semantic infor- mation. As demonstrated in Fig. 1, our model exhibits distinct prediction styles across di ﬀ erent vie ws, akin to the natural variability among predictions. It beneﬁts from limited labeled data, aiming to enhance the model’ s robust- ness by incorporating view v ariance. T o address the abov e challenges, a natural consideration arises—integrating the results from both branches to achiev e a more comprehensiv e and accurate segmentation. Howe ver , a simple addi- tion of the two results may introduce undesirable noise, potentially compromising overall performance. Therefore, we introduce the Scale-aware Adaptive Reweight (SAR) strategy . This approach in volv es pixel-wise weighting of predictions from both branches during the training process, based on the conﬁdence acquired from preceding epochs. This adaptiv e mechanism empowers the network to selectiv ely fa vor more reliable results, mitigating potential errors introduced through direct addition. The incorporation of the SAR strategy reﬁnes and controls the fusion process, resulting in a notable enhancement in ov erall segmentation performance. T o address the challenge of annotation v ari- ance, we introduce a view variance enhancement approach to simulate annotation di ﬀ erences. As shown in Fig. 1, by changing the view of input, we enable the model to output segmentation results from di ﬀ erent views of the same sample, approximating the annotation di ﬀ erences. Furthermore, we consider the segmentation results of branches at di ﬀ erent scales as the annotation di ﬀ erences. This view variance enhancement strategy introduces div ersity into the annotations, capturing the nuanced perspectiv es with varying vie wpoints. By integrating complementary information from di ﬀ erent scales and views, our approach comprehensiv ely accounts for the inherent variations in annotations. Overall, our study proposes a nov el semi-supervised learning approach that incorporates adaptiv e learning and cross- supervised learning and addresses the challenge of annotation variance via view variance enhancement, of which the main contributions are as follo ws: • W e propose a new segmentation network SASNet, which innovati vely adopts a scale-aware adaptive reweight strategy to optimize pix el-wise results from di ﬀ erent branches, generating more reliable ensemble predictions. • W e propose an innov ati ve view v ariance enhancement mechanism, synergistically merged with multi-scale branches, e ﬀ ecti vely emulating the annotation variance. This enhancement augments the resilience of semi- 3 supervised learning, subsequently elev ating the model’ s segmentation performance. • W e ev aluate our SASNet and other SO T A methods on three public datasets: the LA dataset, the Pancreas-CT dataset and the BraTS dataset. The results show that SASNet can outperform existing semi-supervised methods and achiev e performance comparable to fully supervised methods. 𝑃 e 𝑛 s 𝑒 𝑚 𝑏𝑙𝑒 𝑃 𝑠𝑑𝑚 Sh are d E nc od e r Lo w - l e vel D e c ode r SD M Do wn Samp ling C on vBlo c k Up C on vB lo c k S c ale - a w ar e A d ap tiv e R e w eigh t Im ag e 𝑃 𝑙𝑠𝑒𝑔 𝑃 ℎ 𝑠𝑒 𝑔 𝑃 𝑙 𝑟 𝑒 𝑔 𝑃 ℎ 𝑟 𝑒 𝑔 FT I F T Freq u en cy D o m ai n V ie w V ariance E nhance ment E n h an ced Im ag e 𝑃 𝑙𝑠𝑒𝑔 𝑃 ℎ 𝑠𝑒 𝑔 Hi gh - l e vel De c od e r 𝑳 𝑺𝑬𝑮 ( 𝑃 𝑙𝑠𝑒𝑔 , 𝑃 ℎ 𝑠𝑒𝑔 , 𝐺𝑇 ) 𝑳 𝑺𝑹𝑪 ( 𝑃 𝑙𝑟𝑒𝑔 , 𝑃 ℎ𝑟𝑒𝑔 , 𝑃 𝑠𝑑𝑚 ) 𝑳 𝑷 𝑳𝑪 ( 𝑃 𝑙 𝒔 𝑒𝑔 , 𝑃 ℎ 𝒔 𝑒𝑔 ) Figure 2: Overview of SASNet. SASNet consists of three key components: the Dual-branch Architectural Network, the V iew V ariance Enhance- ment Mechanism, and the Scale-A ware Adaptive Reweight Strategy . FT and IFT represent 3D Fourier Transform and 3D Inv erse F ourier Transform, respectiv ely . The training of SASNet is under the supervision of PLC Loss, SRC Loss, and SEG Loss. 2. Related work 2.1. Semi-supervised Learning T raditional semi-supervised learning methods can be divided into self-training and consistent regularization [10, 11]. Self-training is a method to improve the performance of semi-supervised models by using a trained model with labeled data to predict pseudo-labels with high conﬁdence on unlabeled data. Chaitanya et al. [2] designed a local contrast loss to learn features fav orable for segmentation from the generated pseudo-labels during the self- training of the model. Adiga et al. [12] proposed an anatomically-aw are framew ork that lev erages unlabeled data for medical image segmentation. Ma et al. [13] introduced a mixed-domain strategy that constructs intermediate domains to bridge the gap between labeled and unlabeled data from heterogeneous sources. Qi et al. [11] developed a gradient-aware framework targeting class imbalance problems by adaptiv ely modifying gradient updates based on class frequencies. Recent advances in semi-supervised medical image segmentation have explored the collaboration mechanisms and versatile paradigms for lev eraging unlabeled data. Zeng et al. [14] introduced reciprocal collabora- tion framew orks that enable bidirectional knowledge e xchange between network components, while their subsequent 4 work [15] proposed versatile paradigms that adaptively segment diverse anatomical structures through uniﬁed archi- tectural designs. The PICK framework [16] employs prediction-masking strategies to selecti vely utilize conﬁdent pseudo-labels during training. While these approaches demonstrate promising results through collaborative learning and adaptiv e masking, they fundamentally di ﬀ er from our methodology in se veral critical aspects. Unlike reciprocal collaboration mechanisms that rely on symmetric knowledge exchange, our Scale-a ware Adaptiv e Reweight (SAR) strategy introduces asymmetric, conﬁdence-driv en weighting that dynamically adjusts pixel-wise predictions using historical performance matrices from preceding epochs. Furthermore, our approach uniquely integrates 3D Fourier domain transformations for view v ariance enhancement, simulating annotation variability through frequency-domain manipulations rather than con ventional data augmentation. 2.2. Consistency Learning In the framework of consistency regularization, existing methods enhance the generalization ability of models by enforcing consistency in model predictions under di ﬀ erent data augmentations [17]. Y u et al. [18] proposed an uncertainty-guided Mean T eacher framework, which combines transformation consistency to improv e performance. Luo et al. [3] introduced a pyramid consistency regularization framework with uncertainty correction, extending the basic segmentation network to generate pyramid predictions at di ﬀ erent scales, and supervising unlabeled images with multi-scale consistency loss to ensure consistency in predictions at di ﬀ erent scales for the same input. Bai et al. [19] proposed a bidirectional copy-paste method aimed at combining labeled and unlabeled data, implemented within a simple Mean T eacher framework. This method e ﬀ ectively reduces the gap between empirical distributions by encouraging unlabeled data to learn integrated common semantics from labeled data. There are signiﬁcant ad- vances in uncertainty-driven consistency mechanisms and multi-constraint optimization framew orks. The uncertainty- participation context consistency learning approach [20] introduces sophisticated uncertainty quantiﬁcation methods that adaptively weight consistency losses based on prediction conﬁdence, demonstrating how uncertainty estima- tion can enhance pseudo-label reliability . Similarly , recent dev elopments in stereo vision applications have explored decoupling-coupling paradigms [21] that separate certainty estimation from primary task learning while maintaining architectural cohesion through carefully designed coupling mechanisms. Multi-constraint consistenc y learning frame- works [22] further adv ance this paradigm by incorporating di verse consistency objectiv es that operate across multiple semantic and spatial scales, creating robust optimization landscapes for semi-supervised training. These advances establish important precedents for uncertainty-aware learning and multi-objectiv e consistency optimization. How- ev er, our proposed Scale-a ware Adapti ve Reweight (SAR) strategy introduces distinct inno vations that extend be yond con ventional uncertainty-participation frameworks. Unlike static uncertainty weighting schemes, SAR employs tem- poral conﬁdence accumulation across training epochs, creating dynamic pixel-wise weighting matrices that capture long-term prediction reliability patterns rather than instantaneous uncertainty estimates. 5 2.3. Multi-scale Medical Image Se gmentation Due to abundant noise and intricate morphology in medical images, combining multi-scale information encourages models to capture lesion features across various scales [23]. Existing multi-scale medical image segmentation methods can be di vided into multi-scale methods between di ﬀ erent layers and multi-scale methods in the same layer [24]. For the multi-scale methods between di ﬀ erent layers, the contextual information between di ﬀ erent layers is usually learned using [25]. Luo et al. [3] constructed a pyramid prediction network to learn unlabeled data by minimizing the di ﬀ erence between segmented predictions at di ﬀ erent scales and their mean v alues. Liu et al. [5] input samples of di ﬀ erent scales at each layer of the encoder and output features of di ﬀ erent scales in the decoder part for depth supervision. F or multi-scale methods on the same layer , usually dilated con volution or pyramid pooling is used to learn multi-scale features of images on the same layer . W ang et al. [6] inserted a multiscale background fusion module into UNet to fuse the spatial information at di ﬀ erent scales. Howe ver , these methods mentioned abov e only perform multi-scale fusion at the encoder . In our work, we utilize adaptive learning to perform scale-aware adaptiv e reweight ensemble for predictions at di ﬀ erent scales.               Gr ound T ru th Softma x                                                 C on f i denc e ma t ri x 󰇛   󰇜 −     C on f i denc e ma t ri x Figure 3: Details of the Scale-A ware Adapti ve Re weight (SAR) strategy at the i-th epoch. 3. Method The proposed SASNet, as illustrated in Fig. 2, consists of three key components: the Dual-branch Architectural Network, Scale-A w are Adapti ve Re-weighting Strategy , and the V ie w V ariance Enhancement Mechanism. For the semi-supervised training objectiv es of SASNet, we have formulated three types of objecti ves: traditional segmenta- tion objecti ves using limited label information, consistenc y learning objectives between the indi vidual regression pre- 6 diction and the ensemble regression prediction using the SDM algorithm, and pseudo label cross-supervised learning objectiv es using di ﬀ erent scale prediction. T o implement these objectiv es, the training pipeline of SASNet proceeds as follows. The input 3D medical images are ﬁrst processed by the V iew V ariance Enhancement Mechanism to sim- ulate annotation variability , and then fed directly into the Dual-branch Architectural Network, where high-lev el and low-le vel features generate corresponding segmentation and regression outputs. The segmentation results from both branches are integrated via the Scale-A w are Adaptiv e Re-weighting Strategy , combined with temporal conﬁdence accumulation to produce dynamically weighted pixel-le vel predictions. These weighted predictions, after being pro- cessed by the SDM algorithm, are further used to supervise the regression branch, enhancing boundary precision and improving predicti ve capability under semi-supervised conditions. 3.1. Dual-branc h Hierar chical Network T o address the challenges of multi-scale and semi-supervised learning, we design a Dual-branch Hierarchical Net- work, which comprises a shared encoder and two decoders for decoding low-le vel and high-level features, respectively . The low-le vel decoder is responsible for decoding small-scale features, and the high-lev el decoder is responsible for decoding lar ge-scale features. Prior to inputting the data into the network, vie w v ariance enhancement is employed to create multiple view v olumes, from which we randomly select one for processing. The shared encoder is constructed using Con vBlock and downsampling layers that are interleav ed. The input image is ﬁrst con volved to generate image features, which are then downsampled. The output of downsampling serves as input to the next con volutional layer . The features from layers 1 to 4 are passed to the low-le vel decoder, and the features from layers 1 to 5 are passed to the high-lev el decoder . The outputs of the fourth layer and the ﬁnal layer are passed to the low-le vel and high-level decoders, respectiv ely , for feature decoding. The UpConvBlock that constitutes the decoder includes a con volutional layer for decoding and a decon volution layer for upsampling. Additionally , we retain residual connections to improve feature ﬂow between the encoder and decoder . x ′ i = Conv Block i − 1 ( x i − 1 ) (1) x i = DownS am pling i − 1 ( x ′ i ) (2) y i − 1 = U pC onv Block i − 1 ( y i ) + x i − 1 (3) where x ′ i , x i denote the encoding feature after the ( i − 1)-th con volution layer and the output feature of the ( i − 1)-th downsampling layer respectiv ely . And y i − 1 denotes the decoding feature of the ( i − 1)-th UpCon vBlock, whose inputs are y i and x i − 1 respectiv ely . 3.2. Scale-awar e Adaptive Reweight As shown in Fig. 3, we employ a Scale-aware Adaptiv e Re weight (SAR) strategy during the training process to generate more reliable ensemble prediction. Speciﬁcally , within each training epoch, for the labeled data, we compute the average conﬁdence scores from the previous two epochs (( i − 1)-th and ( i − 2)-th epoch) of low-le vel branch and 7 ＋ 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 . 5 1 . 0 1 . 0 1 . 5 1 . 0 1 . 0 1 . 6 0 . 7 1 . 5 1 . 0 1 . 0 0 . 5 1 . 0 1 . 0 0 . 4 1 . 3 1 0 0 0 0 0 0 1 Wi t hou t SA R ＋＋ 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 . 2 0 . 2 0 . 8 0 . 8 0 . 2 0 . 8 0 . 8 0 . 3 0 . 8 0 . 8 0 . 2 0 . 2 0 . 8 0 . 2 0 . 2 0 . 7 0 . 3 0 . 8 0 . 2 0 . 7 0 . 8 0 . 2 0 . 8 0 . 4 0 . 7 0 . 2 0 . 8 0 . 3 0 . 2 0 . 8 0 . 2 0 . 6 1 . 4 1 . 4 1 . 1 1 . 1 1 . 1 1 . 4 1 . 4 1 . 15 0 . 7 0 . 2 0 . 8 0 . 3 0 . 8 0 . 2 0 . 8 0 . 4 ⨀ 0 . 28 0 . 28 0 . 88 0 . 88 0 . 22 1 . 12 1 . 12 0 . 345 1 . 12 1 . 12 0 . 22 0 . 22 0 . 88 0 . 28 0 . 28 0 . 805 0 . 405 0 . 88 0 . 28 0 . 805 1 . 12 0 . 22 1 . 12 0 . 48 0 . 945 0 . 22 1 . 12 0 . 345 0 . 28 0 . 88 0 . 28 0 . 72 0 . 685 1 . 16 1 . 16 1 . 685 1 . 34 1 . 34 2 . 24 0 . 825 2 . 065 1 . 34 1 . 34 0 . 565 1 . 16 1 . 16 0 . 56 1 . 525 1 1 1 0 0 0 0 1 ⨀ ⨀ ⨀ With S A R ＋                                                                                                     0 . 2 0 . 2 0 . 8 0 . 8 0 . 2 0 . 8 0 . 8 0 . 3 0 . 8 0 . 8 0 . 2 0 . 2 0 . 8 0 . 2 0 . 2 0 . 7                     0 . 3 0 . 8 0 . 2 0 . 7 0 . 8 0 . 2 0 . 8 0 . 4 0 . 7 0 . 2 0 . 8 0 . 3 0 . 2 0 . 8 0 . 2 0 . 6                      0 . 8 0 . 8 0 . 2 0 . 2 0 . 2 0 . 8 0 . 8 0 . 3          1 . 35 1 . 1 1 . 4 1 . 15 1 . 4 1 . 1 1 . 4 1 . 2       0 . 8 0 . 8 0 . 2 0 . 2 0 . 2 0 . 8 0 . 8 0 . 3 0 . 7 0 . 2 0 . 8 0 . 3 0 . 8 0 . 2 0 . 8 0 . 4 GT                                                                 Figure 4: Diagram of Scale-aware Adaptiv e Reweight (SAR). Comparison of output results with and without SAR. Assuming that the predicted probabilities Pr ob l from the low-lev el branch and Pr ob h from the high-level branch remain unchanged. C l 1 stands for the conﬁdence matrix of the low-lev el branch in the ﬁrst epoch. α represents the residual coe ﬃ cient. W eighted Pr ob l and W eighted Pr ob h represent the probabilities after weighting for the low-lev el branch and high-level branch, respectiv ely . C l ch 1 denotes the ﬁrst channel of C l 1 , while C l ch 2 signiﬁes the second channel. Similarly , C h ch 1 and C h ch 2 stand for the ﬁrst and second channels of C h 1 , respectiv ely . high-lev el branch respectively . After applying a residual calculation, we individually weight the current predicted probabilities of these two branches. After summing the weighted probabilities, we input them into a softmax function to obtain a ensemble prediction, denoted as P en semble . By pixel-wise weighting of the predictiv e outcomes from two branches and achie ving intelligent weight adjustment, our model is capable of adaptively learning features across various scales. Simultaneously , we multiply the current predicted probabilities of the two branches by the Ground 8 T ruth (GT) to derive the conﬁdence score, C i , for the i -th epoch. The conﬁdence score C i for the i -th epoch can be employed in the subsequent round of calculations, aiming to further enhance the optimization of the model training process. As for the unlabeled data, we a verage the current predicted probabilities from the two branches, P l seg and P h seg , to form the pseudo-label for unlabeled data. The reweighting strategy for labeled data can be represented as the following formulas: P ′ l seg = (1 + α ( C l i − 1 + C l i − 2 ) 2 ) ⊙ P l seg (4) P ′ h seg = (1 + α ( C l i − 1 + C l i − 2 ) 2 ) ⊙ P h seg (5) P en semble = so f tma x ( P ′ l seg + P ′ h seg ) (6) P l seg represents the output predicted by the lo w-le vel decoder , while P h seg represents the output predicted by the high- lev el decoder . C l i − 1 and C l i − 2 respectiv ely denote the conﬁdence matrices of the lo w-le vel decoder for the ( i − 1)-th and ( i − 2)-th epochs, and C h i − 1 and C h i − 2 respectiv ely denote the conﬁdence matrices of the high-level decoder for the ( i − 1)-th and ( i − 2)-th epochs. α represents a residual coe ﬃ cient within the range of 0 to 1. And the above strategy applies a weight of 0.5 as α to each branch empirically . The computation of the conﬁdence maps C i for the i -th epoch can be expressed with the follo wing formulas: C l i = P 0 l seg ⊙ [ GT == 0] + P 1 l seg ⊙ [ GT == 1] (7) C h i = P 0 h seg ⊙ [ GT == 0] + P 1 h seg ⊙ [ GT == 1] (8) where P 0 l seg and P 1 l seg represent the values of the 0th and 1th channels of the output predicted by the lo w-le vel decoder , and P 0 h seg and P 1 h seg represent the values of the 0th and 1th channels of the output predicted by the high-le vel decoder . C l i signiﬁes the conﬁdence map of the low-le vel decoder obtained in the i -th epoch, and C h i represents the conﬁdence map of the high-lev el decoder obtained in the i -th epoch. Fig. 4 illustrates a comparison between the scenarios with and without the use of SAR. It is evident from the graph that employing the SAR strategy can lead the predicted probabilities to conv erge towards the GT . Speciﬁcally , after the application of adaptive reweighting, the values corresponding to the ﬁrst and second positions (highlighted in red) in the second row of Pr ed en semble hav e transitioned from 0 to 1, aligning precisely with the GT . Following the implementation of SAR, although the second and third positions (highlighted in yello w) of the ﬁrst ro w of Pred en semble still do not match the GT , there has been a noticeable change in the values of channel1 / channel2 at the second position. These values hav e decreased from 1.5 / 0.5 to 1.685 / 0.565. This indicates that P en semble tends to increase the predicted probability for channel2 (foreground), leading the prediction towards 1. Similarly , a similar logic applies to the third position in the ﬁrst row , where the values of channel1 / channel2 hav e increased from 0.7 / 1.3 to 0.825 / 1.525. This suggests that P en semble tends to increase the predicted probability for channel one (background), steering the expected result to 0. 9 3.3. V ie w V ariance Enhancement Mechanism T o simulate annotation di ﬀ erences and enhance the robustness of the model, we hav e devised a mechanism for enhancing view variance. Initially , we employ 3D Fourier Transform (FT) to map the input 3D images into the frequency domain. Subsequently , a series of data augmentation operations, such as rotation, are performed in the frequency domain to generate a sequence of images with di ﬀ erent viewpoints. These operations enable the model to encounter a richer variety of viewpoint changes during the training phase, thereby improving its rob ustness to input images from di ﬀ erent viewpoints. Then, we use the Inv erse Fourier T ransform (IFT) to map these transformed images back to the time domain. In our approach, we sequentially select three viewpoints as inputs. W e hypothesize that the prediction results from di ﬀ erent viewpoints reﬂect the di ﬀ erences in annotations under the same vie wpoint. Therefore, e xposing the model to such di ﬀ erences during training can enhance its robustness and adaptability to complex annotations and di verse inputs. This method not only simulates annotation di ﬀ erences in the real world but also enhances the model’ s general- ization ability to unknown viewpoint inputs, making it more reliable and e ﬀ ective in practical applications. Di ﬀ erent from con ventional spatial-domain augmentation or standard consistency re gularization methods, this mechanism is capable of generating image transformations that reﬂect clinical annotation di ﬀ erences, making the training process more speciﬁcally targeted at the inherent uncertainty in real annotations, rather than merely increasing data diver - sity . Speciﬁcally , operations in the frequency domain can simultaneously a ﬀ ect both local and global image features, enabling comprehensiv e and coherent transformations while preserving critical anatomical structures and naturally maintaining the fundamental harmonic relationships in medical images, thereby avoiding interpolation artifacts com- monly introduced by spatial rotations. In addition, this mechanism, combined with the dual-branch architecture, generates controlled prediction variations, using these variations as proxies for inter -annotator disagreement, which allows the model to learn rob ust features despite annotation uncertainty . The speciﬁc operations can be represented by following formulas: I f = R R R I · e − 2 π i ( k x x + k y y + k z z ) d xd yd z (9) I f ′ = rotate( I f , θ ) (10) I ′ = R R R I f ′ · e 2 π i ( k x x + k y y + k z z ) d k x d k y d k z (11) where I represents an input 3D image, with x , y , z representing the three coordinates in space, and k x , k y , k z representing the coordinates in the frequency domain. I f is the image mapped to the frequency domain after under going 3D Fourier T ransform, and I ′ f is the image obtained from rotating in the frequency domain, with θ representing the rotation angle. 3.4. Semi-supervised T raining Objectives T o achie ve e ﬀ ective semi-supervised training of SASNet, we have designed multiple training objectiv es to super- vise the training process. The ﬁrst is Limited Annotation Learning. W e use limited label information G T to monitor the training of dual branches through L seg . Follo wed by Pseudo Label Cross-supervised Learning the segmentation 10              DT ~ D T FB No r m                            Figure 5: Diagram of SDM (Signed Distance Map) Generation Pipeline. The workﬂow includes boundary extraction via feedback (FB), parallel distance transforms (DT and ∼ DT) producing positiv e and negati ve distance maps. predictions of the two branches are supervised through L plc as pseudo-label of the other segmentation prediction, re- spectiv ely . Finally , Segmentation-Regression Consistency Learning is used to use integrated prediction P en semble and SDM algorithms to form P sdm to supervise the regression predictions of the two branches through L src . The total loss function is as follo ws: L S emi = β ∗ L seg + γ ∗ ( L plc + L src ), where the weight coe ﬃ cient β is 0.5. And during the training process of SASNet, the consistency coe ﬃ cient γ is increased by the sigmoid ramp-up function [26] as the training epochs increase. The ﬁnal consistency coe ﬃ cient γ is 1 after 40 training epochs. 3.4.1. Limited Annotation Learning In the supervision part, we utilize limited label information for supervision. As shown in Fig. 2, GT and L seg supervise the prediction output of the low-le vel branch and the high-le vel branch, respectively . The SEG loss we utilize is Dice loss and cross-entropy loss, as sho wn belo w: L seg = L Dice + L C E (12) L Dice = 1 − P N i = 1 1 N · 2 | p i ∩ y i | ( | p i | + | y i | ) (13) L C E = − P N i = 1 1 N · y i log ( p i ) (14) where p i and y i represent the prediction and GT for the i -th vox el. The N denotes the total number of vox els. 3.4.2. Pseudo Label Cr oss-supervised Learning Di ﬀ erent from the pre vious pseudo-label learning, the initial model is trained with labeled data, and then the prediction results of the initial model are used as the pseudo-label for unlabeled data. W e use the prediction results of the two branches as the pseudo-labels of each other for cross-supervision and learning. The PLC Loss is calculated by the following formula: L plc = P N i = 1 1 N · ( p l seg − p h seg ) 2 (15) 11 where p l seg and y h seg represent the segmentation prediction of i -th voxel by the lo w-lev el branch and high-lev el branch. 3.4.3. Se gmentation-Regr ession Consistency Learning W e also introduce consistency learning for the training of SASNet. W e ﬁrst form the integrated segmentation prediction p en semble through the SAR strategy . Then p en semble is processed through SDM to form p sdm . And p sdm is applied to the regression predictions p lreg and p hreg for segmentation-re gression consistency learning. The loss function L S RC is delineated below: L src = P N i = 1 [( P sdm − P lreg ) 2 + ( P sdm − P hreg ) 2 ] N (16) where p lreg and y hreg represent the regression prediction of the i -th voxel by the lo w-le vel branch and the high-le vel branch respectiv ely . p sdm denotes the output of p en semble after the SDM processing. The SDM procedure is illustrated in Fig. 5, Po sd i s is form ed by the distance_transform_edt ( DT ) function from the segmentation prediction P seg , and the result of anti-DT ( ∼ DT ) is N egd i s . At the same time, the Bord er is obtained through the Find_boundaries ( F B ) function. And then the S D M is obtained by Po sd i s , N egd i s , and Bord er , and ﬁnally forms P S D M through a normalized operation. 4. Experiments 4.1. Datasets LA [27]: The left atrial dataset comprises 100 gadolinium-enhanced MR imaging scans with a uniform resolution of 0 . 625 × 0 . 625 × 0 . 625 mm 3 . T o ensure consistency with prior work [28], we use the same 80 of these images for training and 20 for validation. Prior to training, we preprocess the images by expanding the bounds to randomly selected values between 10-20, 10-20, and 5-10, respecti vely . Pancr eas-CT [29]: The pancreas dataset is collected by the National Institutes of Health Clinical Center (NIH) and contains 82 3D abdominal CT scans annotated by e xperts. W e randomly select 62 of them for training and the rest 20 for validation. W e ﬁrst resample the spacing of the data to 1 . 0 × 1 . 0 × 1 . 0 mm 3 . Then, we crop the voxels with Hounsﬁeld Units value from -125 to 275, e xpand the edges by 25, 25 and 0 voxels, respecti vely . BraTS [30]: The Brain Tumor Segmentation 2019 dataset comprises preoperati ve MRI scans from 335 glioma patients across multiple institutions. Each patient’ s MRI includes four modalities: T1, T1Gd, T2, and T2-FLAIR. W e speciﬁcally utilize the T2-FLAIR modality for whole tumor segmentation due to its superior ability to highlight malignant tumors [31]. All images are resampled to an isotropic resolution of 1 . 0 × 1 . 0 × 1 . 0 mm 3 . For our e xperiments, we use 250 samples for training, 25 for validation, and the remaining 60 for testing. 4.2. Implementation Details All our experiments are conducted in Pytorch 1.12 and CUD A 11.3 frame work with NVIDIA A100 GPU. Our proposed SASNet employs V -Net as the backbone. In the training phase, we randomly crop to a size of 112 × 112 × 80 12 LA Pancreas-CT UA-MT SASSNet DTC URPC MCNet SASNet GT Figure 6: Qualitative results of SASNet and other methods on the LA dataset and P ancreas-CT dataset with 20% of labeled data. case1 case2 EM UA-MT DTC URPC AUSS S ASNet GT Figure 7: Qualitative results of SASNet and other methods on the BraTS dataset with 20% of labeled data. for the LA dataset, and we randomly crop to a size of 96 × 96 × 96 for the Pancreas-CT dataset and the BraTS dataset. The SGD optimizer is utilized with an initial learning rate of 0.01. The batch size is 4, and the temperature in the sharpening function is 0.1. In the testing phase, we use Dice coe ﬃ cient (Dice), Jaccard coe ﬃ cient (Jaccard), A v erage surface distance (ASD) and 95% Hausdor ﬀ Distance (HD95) to compare the performance di ﬀ erence with other methods. 13 Method #Scans used Metrics Labeled Un Dice (%) ↑ Jaccard (%) ↑ HD95 (vox el) ↓ ASD (vox el) ↓ V -Net [32] (3D V 2016) 8(10%) 0 79.99 68.12 21.11 5.48 V -Net [32] (3D V 2016) 16(20%) 0 86.03 76.06 14.26 3.51 V -Net [32] (3D V 2016) 80(100%) 0 91.14 83.82 5.75 1.52 U A-MT [18] (MICCAI 2019) 8(10%) 72 86.28 76.11 18.71 4.63 SASSNet [33] (MICCAI 2020) 8(10%) 72 85.22 75.09 11.18 2.89 DTC [28] (AAAI 2021) 8(10%) 72 87.51 78.17 8.23 2.36 URPC [3] (MICCAI 2021) 8(10%) 72 85.01 74.36 15.37 3.96 MRNet[34] (CVPR 2021) 8(10%) 72 86.07 75.86 19.24 5.51 SCC [35] (CMIG 2022) 8(10%) 72 86.51 76.54 10.51 2.56 ICT [36] (Neural Networks 2022) 8(10%) 72 85.39 74.84 17.45 2.88 MC-Net + [37] (MedIA 2022) 8(10%) 72 87.50 77.98 11.28 2.30 CauSSL[38] (ICCV 2023) 8(10%) 72 87.49 77.95 18.85 5.11 PLGCL[39] (ICCV 2023) 8(10%) 72 87.28 77.54 18.98 5.32 A USS[12] (MedIA 2024) 8(10%) 72 87.57 78.32 8.17 2.22 MiDSS[13] (CVPR 2024) 8(10%) 72 88.03 78.69 8.04 2.16 GALoss[11] (ECCV 2024) 8(10%) 72 87.86 78.55 8.01 2.07 SASNet 8(10%) 72 89.62 81.33 6.59 1.89 U A-MT [18] (MICCAI 2019) 16(20%) 64 88.74 79.94 8.39 2.32 SASSNet [33] (MICCAI 2020) 16(20%) 64 89.16 80.60 8.95 2.26 DTC [28] (AAAI 2021) 16(20%) 64 89.52 81.22 7.07 1.96 URPC [3] (MICCAI 2021) 16(20%) 64 88.74 79.93 12.73 3.66 MRNet[34] (CVPR 2021) 16(20%) 64 88.62 80.94 8.83 2.48 SCC [35] (CMIG 2022) 16(20%) 64 89.81 81.64 7.15 1.82 ICT [36] (Neural Networks 2022) 16(20%) 64 89.02 80.34 10.38 1.97 MC-Net + [37] (MedIA 2022) 16(20%) 64 90.12 82.12 8.07 1.99 CauSSL[38] (ICCV 2023) 16(20%) 64 90.16 82.17 6.11 1.97 PLGCL[39] (ICCV 2023) 16(20%) 64 90.01 82.04 6.21 2.15 A USS[12] (MedIA 2024) 16(20%) 64 90.79 82.91 6.13 1.93 MiDSS[13] (CVPR 2024) 16(20%) 64 90.47 82.55 6.17 1.80 GALoss[11] (ECCV 2024) 16(20%) 64 90.39 82.46 6.35 2.01 SASNet 16(20%) 64 91.82 84.93 4.63 1.42 T able 1: Compared with the other SOT A methods on the LA dataset with 10% and 20% of the labeled data. Bold represents the best performance, while red represents the second-best performance. 4.3. P erformance Comparison with Other Methods 4.3.1. P erformance on LA Dataset T able 1 presents a comparison of our method with other state-of-the-art approaches on the LA dataset. W e ev aluate our method on the V -Net using 10% and 20% labeled data as well as fully supervised data. With only 10% labeled data at hand, our model e xhibits note worthy enhancements o ver DTC [28]. Speciﬁcally , we achiev e improv ements of 2.11% and 3.16% in Dice and Jaccard scores, respecti vely . When 20% labeled data is av ailable, our method achiev es a 14 Method #Scans used Metrics Labeled Un Dice(%) ↑ Jaccard(%) ↑ HD95 (vox el) ↓ ASD (vox el) ↓ V -Net[32] (3D V 2016) 6(10%) 0 54.94 40.87 47.48 17.43 V -Net[32] (3D V 2016) 12(20%) 0 71.52 57.68 18.12 5.41 V -Net[32] (3D V 2016) 62(100%) 0 82.60 70.81 5.61 1.33 U A-MT[18] (MICCAI 2019) 6(10%) 56 66.44 52.02 17.04 3.03 SASSNet[33] (MICCAI 2020) 6(10%) 56 68.97 54.29 18.83 1.96 DTC[28] (AAAI 2021) 6(10%) 56 66.27 52.07 15.00 4.44 URPC[3] (MICCAI 2021) 6(10%) 56 73.53 59.44 22.57 7.85 MRNet[34] (CVPR 2021) 6(10%) 56 72.47 56.92 15.08 5.14 MC-Net + [37] (MedIA 2022) 6(10%) 56 68.94 54.74 16.28 3.16 PLGCL[39] (CVPR 2023) 6(10%) 56 73.15 58.83 14.36 4.51 BCP[19] (CVPR 2023) 6(10%) 56 74.25 60.03 14.23 4.42 A USS[12] (MedIA 2024) 6(10%) 56 73.82 59.33 14.68 4.95 MiDSS[13] (CVPR 2024) 6(10%) 56 74.12 59.84 13.93 3.97 GALoss[11] (ECCV 2024) 6(10%) 56 73.94 59.51 14.52 4.71 SASNet 6(10%) 56 76.38 62.84 13.47 1.82 U A-MT[18] (MICCAI 2019) 12(20%) 50 76.10 62.62 10.84 2.43 SASSNet[33] (MICCAI 2020) 12(20%) 50 76.39 63.17 11.06 1.42 DTC[28] (AAAI 2021) 12(20%) 50 78.27 64.75 8.36 2.25 URPC[3] (MICCAI 2021) 12(20%) 50 80.02 67.30 8.51 1.98 MRNet[34] (CVPR 2021) 12(20%) 50 77.82 63.85 13.86 4.53 MC-Net + [37] (MedIA 2022) 12(20%) 50 79.05 65.83 10.29 2.72 PLGCL[39] (CVPR 2023) 12(20%) 50 78.41 65.17 14.13 4.68 BCP[19] (CVPR 2023) 12(20%) 50 80.37 67.81 11.53 2.06 A USS[12] (MedIA 2024) 12(20%) 50 80.02 66.73 12.38 3.14 MiDSS[13] (CVPR 2024) 12(20%) 50 79.74 66.56 12.65 3.47 GALoss[11] (ECCV 2024) 12(20%) 50 80.21 66.92 11.78 2.83 SASNet 12(20%) 50 81.60 69.39 11.25 1.81 T able 2: Compared with the other SOT A methods on the Pancreas-CT dataset with 10% and 20% of the labeled data. Bold represents the best performance, while red represents the second-best performance. 1.70% improv ement in Dice and a 2.81% improvement in Jaccard compared to MC-Net + [37]. Notably , our approach ev en outperforms the fully supervised results obtained on V -Net. Based on the visualizations in Fig. 6, the ﬁrst and second rows demonstrate that with only 20% labeled data, other baseline models exhibit fragmentation or missing portions in the challenging areas. In contrast, our method achieves comprehensi ve segmentation in these challenging regions. 4.3.2. P erformance on P ancr eas-CT Dataset T able 2 presents a comparison of our method with other state-of-the-art techniques on the P ancreas-CT dataset. When the labeled data is limited to only 10%, SASNet achiev es signiﬁcant impro vements compared to other methods. 15 Method #Scans used Metrics Labeled Un Dice(%) ↑ Jaccard(%) ↑ HD95 (vox el) ↓ ASD (voxel) ↓ V -Net[32] (3D V 2016) 25(10%) 0 73.00 59.96 42.64 2.96 V -Net[32] (3D V 2016) 50(20%) 0 76.14 64.15 36.01 2.70 V -Net[32] (3D V 2016) 250(100%) 0 84.81 75.53 8.08 1.94 EM[40] (ECCV 2019) 25(10%) 225 81.74 71.42 13.71 2.37 U A-MT[18] (MICCAI 2019) 25(10%) 225 80.85 70.32 14.61 2.57 DTC[28] (AAAI 2021) 25(10%) 225 81.96 71.84 12.08 2.43 URPC[3] (MICCAI 2021) 25(10%) 225 81.80 71.63 11.50 2.48 A USS[12] (MedIA 2024) 25(10%) 225 81.99 71.92 11.47 2.44 MiDSS[13] (CVPR 2024) 25(10%) 225 81.91 71.03 11.38 2.37 GALoss[11] (ECCV 2024) 25(10%) 225 82.13 72.11 11.25 2.39 SASNet 25(10%) 225 82.84 73.00 10.91 2.31 EM[40] (ECCV 2019) 50(20%) 200 82.37 72.28 15.83 2.30 U A-MT[18] (MICCAI 2019) 50(20%) 200 81.87 71.42 13.98 2.49 DTC[28] (AAAI 2021) 50(20%) 200 82.78 72.47 13.43 2.20 URPC[3] (MICCAI 2021) 50(20%) 200 82.80 72.72 12.48 2.72 A USS[12] (MedIA 2024) 50(20%) 200 83.21 73.55 11.39 2.03 MiDSS[13] (CVPR 2024) 50(20%) 200 83.74 73.93 11.06 2.29 GALoss[11] (ECCV 2024) 50(20%) 200 83.49 73.72 10.61 2.18 SASNet 50(20%) 200 85.84 76.79 7.52 1.62 T able 3: Compared with the other SOT A methods on the BraTS dataset with 10% and 20% of the labeled data. Bold represents the best performance, while red represents the second-best performance. Speciﬁcally , our method outperforms URPC [3] in Dice score by 2.85% and in Jaccard by 3.40%. As the percentage of labeled data increased to 20%, our method outperforms URPC [3] in Dice score by 1.58% and in Jaccard score by 2.09%. The third and fourth rows of Fig. 6 illustrate the visual results on the Pancreas-CT dataset with 20% labeled data, other methods tend to misclassify background as foreground. In contrast, our method e ﬀ ectively addresses this issue and accurately segments the tar get. 4.3.3. P erformance on BraTS Dataset T able 3 compares our method with other state-of-the-art approaches on the BraTS dataset. Notably , our method achiev es superior performance even with only 20% of the data labeled, outperforming fully supervised approaches. Additionally , when compared to other semi-supervised methods with the same amount of labeling, such as URPC [3], our method surpasses it by 3.04% in Dice score and by 4.07% in Jaccard score. As shown in Fig. 7, in the ﬁrst example, other methods misclassify the background as foreground in areas where the lesion boundaries are di ﬃ cult to distinguish. In the second example, while other methods erroneously predict a non-lesion region within the lesion area as part of the lesion, our model successfully av oids this misclassiﬁcation, demonstrating superior segmentation capability in terms of ﬁner details. 16 4.4. Ablation Experiments 4.4.1. E ﬀ ects of di ﬀ er ent branch numbers. In the absence of SAR, we ev aluate the proposed model with di ﬀ erent numbers of branches, ranging from one to two, and compare their performance using three di ﬀ erent loss functions: L seg , L src , and L plc . Speciﬁcally , In T able 4, we observe that the two-branch model outperforms the one-branch model (high-le vel) in terms of the Dice coe ﬃ cient when only L seg is used or when L seg and L src are used simultaneously . For instance, on the LA dataset, the Dice coe ﬃ cient of the two-branch model is 0.51% higher than that of the one-branch MC-Net + [37] when only L seg is used. The result suggests that the additional low-le vel branch helps to improve the segmentation performance by lev eraging the complementary information in di ﬀ erent loss functions. Method Nums L seg L src L plc Dice (%) ↑ Jaccard (%) ↑ HD95 (vox el) ↓ ASD (voxel) ↓ MC-Net + 1 ✓ 90.12 82.12 8.07 1.99 SASNet 1 ✓ ✓ 90.23 82.37 8.90 2.44 SASNet 2 ✓ 90.63 82.99 6.50 2.13 SASNet 2 ✓ ✓ 90.94 83.85 6.01 1.93 SASNet 2 ✓ ✓ ✓ 91.52 84.42 4.98 1.48 T able 4: Results of di ﬀ erent branch numbers and training objectives on the LA dataset. Method V ie w V ariance L seg L src L plc Dice(%) ↑ Jaccard(%) ↑ HD95 (vox el) ↓ ASD (voxel) ↓ SASNet ✓ ✓ 90.63 82.99 6.50 2.13 SASNet ✓ ✓ ✓ 90.94 83.53 6.01 1.93 SASNet ✓ ✓ ✓ 91.25 84.00 5.06 1.56 SASNet ✓ ✓ ✓ 91.16 83.84 5.80 1.65 SASNet ✓ ✓ ✓ ✓ 91.52 84.42 4.98 1.48 T able 5: Results of view v ariance enhancement mechanism and training objectiv es without SAR on the LA dataset. 4.4.2. E ﬀ ects of enhancement mechanism and training objectives. T able 5 displays ablation results for both the augmentation mechanism and training objecti ves without SAR on the LA dataset. It is evident from the table that the inclusion of either L src or L plc yields performance improvements in the presence of the augmentation mechanism, surpassing the performance achiev ed by using only L seg . This phenomenon serves as compelling e vidence of the e ﬃ cacy of our designed loss functions. Furthermore, the model’ s performance gains further momentum when all three loss functions are employed. Additionally , a comparison between the results showcased in the fourth and ﬁfth ro ws vividly illustrates the e ﬀ ecti veness of our proposed view v ariance mechanism. 17 4.4.3. Ablation studies of SAR / SDM / low-level on the datasets. W e conducted comprehensiv e ablation e xperiments across all three ev aluated datasets to systematically assess the individual contrib utions of our proposed Scale-aware Adaptiv e Reweight mechanism, Signed Distance Map process- ing, and low-le vel feature integration. As sho wn in T able 6-8, the experimental results indicate that incorporating SAR, SDM, and low-le vel methods into the model can signiﬁcantly improve performance. The Scale-aware Adap- tiv e Re weight strategy yields substantial improv ements across all datasets, with Dice coe ﬃ cient enhancements of 0.30% on LA, 0.74% on Pancreas-CT , and 0.27% on BraTS datasets, accompanied by consistent boundary precision improv ements reﬂected in reduced HD95 distances. These ﬁndings substantiate the e ﬀ ectiv eness of our conﬁdence- based pixel-wise weighting approach in generating more reliable ensemble predictions by selectiv ely emphasizing high-conﬁdence regions while mitigating potential errors from unreliable predictions. The Signed Distance Map component demonstrates the robust performance gains, achie ving Dice coe ﬃ cient improvements of 0.92% on LA, 0.93% on Pancreas-CT , and 0.52% on BraTS datasets, which facilitates e ﬀ ecti ve se gmentation-regression consistency learning that bridges geometric understanding with probabilistic predictions. The integration of low-le vel features consistently enhances segmentation performance across all ev aluated datasets, with notable improvements in small structure detection and boundary reﬁnement, conﬁrming that the complementary utilization of multi-scale feature representations enables more comprehensiv e anatomical understanding. The consistent performance patterns across heterogeneous datasets spanning cardiac, abdominal, and neurological imaging modalities demonstrate the broad ap- plicability and methodological robustness of our proposed components, establishing their importance in advancing semi-supervised medical image segmentation under limited annotation scenarios. Method LA Dataset Dice(%) ↑ Jaccard(%) ↑ HD95 (voxel) ↓ ASD (vox el) ↓ w / o SAR 91.52 84.42 4.98 1.48 with SAR 91.82 (0.30% ↑ ) 84.93 (0.51% ↑ ) 4.63 (0.35 ↓ ) 1.42 (0.06 ↓ ) w / o SDM 90.90 83.39 5.13 1.57 with SDM 91.82 (0.92% ↑ ) 84.93 (1.54% ↑ ) 4.63 (0.50 ↓ ) 1.42 (0.15 ↓ ) w / o low-le vel 90.23 82.37 8.90 2.44 with low-le vel 90.94 (0.71% ↑ ) 83.85 (1.48% ↑ ) 6.01 (2.89 ↓ ) 1.93 (0.51 ↓ ) T able 6: Ablation studies of SAR / SDM / low-le vel on the LA dataset. 4.4.4. E ﬀ ects of di ﬀ er ent layers as the low-level decoder . T able 9 presents the ablation results for di ﬀ erent numbers of layers in the low-le vel decoder . It is clear that the 4-layer decoder consistently outperforms the 3-layer conﬁguration across all e valuation metrics. This indicates that a 4-layer low-le vel decoder more e ﬀ ectiv ely captures ﬁne-grained local features, which is particularly important for tasks requiring precise boundary delineation. The concurrent improvement in both region-based and boundary-based 18 Method Pancreas-CT Dataset Dice(%) ↑ Jaccard(%) ↑ HD95 (voxel) ↓ ASD (vox el) ↓ w / o SAR 80.86 68.49 8.55 1.31 with SAR 81.60 (0.74% ↑ ) 69.39 (0.90% ↑ ) 11.25 (2.70 ↑ ) 1.81 (0.50 ↑ ) w / o SDM 80.67 68.13 12.72 1.77 with SDM 81.60 (0.93% ↑ ) 69.39 (1.26% ↑ ) 11.25 (1.47 ↓ ) 1.81 (0.04 ↑ ) w / o low-le vel 79.56 66.84 10.41 1.62 with low-le vel 80.22 (0.66% ↑ ) 67.59 (0.75% ↑ ) 10.20 (0.21 ↓ ) 1.92 (0.30 ↑ ) T able 7: Ablation studies of SAR / SDM / low-le vel on the Pancreas-CT dataset. Method BraTS Dataset Dice(%) ↑ Jaccard(%) ↑ HD95 (voxel) ↓ ASD (vox el) ↓ w / o SAR 85.57 76.33 7.73 1.81 with SAR 85.84 (0.27% ↑ ) 76.79 (0.46% ↑ ) 7.52 (0.21 ↓ ) 1.62 (0.19 ↓ ) w / o SDM 85.32 76.16 7.89 1.90 with SDM 85.84 (0.52% ↑ ) 76.79 (0.63% ↑ ) 7.52 (0.37 ↓ ) 1.62 (0.28 ↓ ) w / o low-le vel 84.39 75.22 8.67 2.12 with low-le vel 85.03 (0.64% ↑ ) 75.94 (0.72% ↑ ) 8.25 (0.42 ↓ ) 1.96 (0.16 ↓ ) T able 8: Ablation studies of SAR / SDM / low-le vel on the BraTS dataset. metrics suggests that increasing the depth of the low-le vel decoder not only preserv es high-le vel semantic information but also enhances the representation of local spatial details. By comparison, the 3-layer decoder, while still e ﬀ ectiv e, does not fully lev erage the rich local features provided by the encoder . These observations underscore the strong compatibility and e ﬃ cacy of the 4-layer conﬁguration as an adept lo w-level decoder . Method Dice(%) ↑ Jaccard(%) ↑ HD95 (voxel) ↓ ASD (vox el) ↓ SASNet (3 Layers) 91.20 83.90 5.38 1.80 SASNet (4 Layers) 91.82 (0.62% ↑ ) 84.93 (1.03% ↑ ) 4.63 (0.75 ↓ ) 1.42 (0.38 ↓ ) T able 9: Results of employing di ﬀ erent layers to the low-le vel decoder on LA dataset. 4.4.5. Ablation Studies of Di ﬀ erent Hyperparameters T o establish robust methodological v alidation and enhance experimental reproducibility , we conducted ablation experiments in vestigating the impact of critical hyperparameters β and γ on segmentation performance across ﬁv e distinct conﬁgurations using the LA dataset with 20% labeled data as sho wn in T able 10. The comprehensi ve analysis 19 demonstrates se veral insights that validate our theoretical design choices. Direct application of γ equal to one without ramp-up scheduling yields suboptimal performance with a Dice coe ﬃ cient of 91.36%, demonstrating the necessity of progressiv e consistency weight escalation during training initialization to prev ent pseudo-label propagation instabil- ities that can destabilize early learning dynamics. The optimal conﬁguration combining β equal to 0.5 with γ equal to one under sigmoid ramp-up scheduling achiev es superior performance across all ev aluation metrics, attaining a Dice coe ﬃ cient of 91.82%, Jaccard coe ﬃ cient of 84.93%, HD95 distance of 4.63 voxels, and a verage surface dis- tance of 1.42 v oxels. This conﬁguration establishes the e ﬀ ecti veness of balanced supervised-consistency learning that equalizes the inﬂuence of labeled supervision and semi-supervised regularization components. Sensitivity analysis across di ﬀ erent γ values ranging from 0.5 to one under optimal β conditions demonstrates that stronger consistency regularization enhances segmentation accuracy , while modiﬁcations to β beyond the balanced conﬁguration exhibit diminishing returns in performance gains. Hyperparameter Dice(%) ↑ Jaccard(%) ↑ HD95 (voxel) ↓ ASD (voxel) ↓ β : 0.5 γ : 1.0 (w / o ramp-up) 91.36 84.24 5.02 1.75 β : 0.5 γ : 0.5 (with ramp-up) 91.02 83.87 5.12 1.98 β : 0.5 γ : 1.0 (with ramp-up) 91.82 84.93 4.63 1.42 β : 1.0 γ : 0.5 (with ramp-up) 91.15 84.03 5.06 1.86 β : 1.0 γ : 1.0 (with ramp-up) 91.64 84.55 4.91 1.47 T able 10: Ablation studies of the hyperparameters ( β and γ ) on the LA datasets. 4.5. Interpr etability Analysis Fig. 8 presents feature maps from the middle layer’ s con volutional blocks in the decoder . By contrasting the feature maps of high-level and low-le vel branches, we notice that the low-le vel branch pays greater attention to local features, aiding in distinguishing between foreground and background. In contrast, the high-lev el branch tends to capture more global and abstract features, that can provide a holistic understanding of the image. Additionally , we note that the high-level branches are equipped with more residual connections than the low-le vel branches, which allows the high-le vel branch to integrate ﬁne-grained details from the lower layers and reﬁne the boundaries between objects and background, resulting in more accurate and coherent segmentation maps. Overall, the multi-scale branch achiev es a good balance between local and global information, and produces feature maps with high-quality boundary cues. 4.6. Model Complexity Analysis As shown in T able 11, SASNet has a relativ ely large number of parameters, while its FLOPs are slightly lower than URPC[3] . The increased complexity is mainly due to SASNet integrating the predictions from two branches to improv e the ﬁnal prediction accuracy . Although this introduces additional computational ov erhead, the enhanced 20 Low-level High-level Slice-57 Slice-49 Slice-27 Slice-65 Figure 8: V isual comparison between low-le vel and high-lev el feature maps of same case. model capacity enables superior segmentation performance, achieving a reasonable balance between accuracy and complexity . Complexity Methods V -Net UA-MT SASSNet DTC URPC MC-Net + SASNet Para. (M) 9.18 9.18 9.44 9.44 5.85 9.44 11.26 FLOPs (G) 46.85 46.85 46.88 46.88 69.36 46.88 67.31 T able 11: Comparisons of computational complexity of inference between SASNet and other methods on the LA dataset. 5. Discussion Although SASNet achieves competitive results, it is important to consider certain limitations. The current e v al- uation is restricted to three public datasets (LA, Pancreas-CT , and BraTS), which mainly cover cardiac, abdominal, and brain regions. This leaves open the question of whether the method can generalize to multi-organ segmentation tasks or more heterogeneous clinical cohorts. Moreov er , SASNet has been primarily designed and v alidated for 3D volumetric data, and it has not been ev aluated on 2D slice-based data or other imaging modalities (e.g., ultrasound, X-ray), so its applicability in these scenarios remains uncertain and warrants further in vestigation. In light of these limitations, future work could extend SASNet to a broader range of segmentation tasks and systematically assess its generalization across di ﬀ erent imaging modalities, scanner types, and multi-center clinical datasets. Moreov er , integrating SASNet with other advanced techniques, such as graph-based embeddings or Trans- former architectures, may further enhance its robustness and practical applicability . 21 6. Conclusion In this paper , we propose a novel semi-supervised se gmentation network based on scale in v ariance (SASNet). The SASNet incorporates a new multi-scale learning—scale-a ware adaptiv e learning into semi-supervised learning, which is combined with the SDM algorithm to achieve consistent learning of ensemble regression prediction and individual regression prediction. Additionally , we propose a view variance enhancement mechanism combined with multi-scale branches, which emulates annotation variations. This augmentation enhances the robustness of semi-supervised learn- ing, thereby elev ating the segmentation performance of the model. The ev aluation of three public datasets including the left atrium (LA) dataset, the Pancreas-CT dataset, and the BraTS dataset shows that SASNet outperforms exist- ing semi-supervised methods and achieves performance comparable to fully supervised methods. Future studies can in vestigate the potential of applying SASNet to various segmentation tasks and examine the feasibility of integrating SASNet with other techniques. References [1] Z. Li, D. Song, Z. Y ang, D. W ang, F . Li, X. Zhang, P . E. Kinahan, Y . Qiao, V isionunite: A vision-language foundation model for ophthalmology enhanced with clinical kno wledge, IEEE T ransactions on Pattern Analysis and Machine Intelligence (2025). [2] K. Chaitan ya, E. Erdil, N. Karani, E. Konuk oglu, Local contrasti ve loss with pseudo-label based self-training for semi-supervised medical image segmentation, Medical Image Analysis (2023) 102792. [3] X. Luo, W . Liao, J. Chen, T . Song, Y . Chen, S. Zhang, N. Chen, G. W ang, S. Zhang, E ﬃ cient semi-supervised gross target volume of nasopharyngeal carcinoma segmentation via uncertainty rectiﬁed pyramid consistency , in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, P art II 24, Springer , 2021, pp. 318–329. [4] C. Y ou, Y . Zhou, R. Zhao, L. Staib, J. S. Duncan, Simcvd: Simple contrastive voxel-wise representation distil- lation for semi-supervised medical image segmentation, IEEE T ransactions on Medical Imaging 41 (9) (2022) 2228–2237. [5] X. Liu, L. Y ang, J. Chen, S. Y u, K. Li, Region-to-boundary deep learning model with multi-scale feature fusion for medical image segmentation, Biomedical Signal Processing and Control 71 (2022) 103165. [6] X. W ang, Z. Li, Y . Huang, Y . Jiao, Multimodal medical image segmentation using multi-scale context-aw are network, Neurocomputing 486 (2022) 135–146. [7] A. T arv ainen, H. V alpola, Mean teachers are better role models: W eight-av eraged consistency targets improve semi-supervised deep learning results, Advances in neural information processing systems 30 (2017). 22 [8] Y . Zhou, et al., Semi-supervised 3d abdominal multi-organ segmentation via deep multi-planar co-training, in: 2019 IEEE W inter Conference on Applications of Computer V ision (W A CV), IEEE, 2019, pp. 121–140. [9] Z. Li, Y . Li, Q. Li, P . W ang, D. Guo, L. Lu, D. Jin, Y . Zhang, Q. Hong, Lvit: language meets vision transformer in medical image segmentation, IEEE transactions on medical imaging (2023). [10] Z. Zhang, Z. Li, D. Shan, Y . Qiu, Q. Hong, Q. W u, An intra-and cross-frame topological consistency scheme for semi-supervised atherosclerotic coronary plaque segmentation, in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2025, pp. 1–5. [11] W . Qi, J. W u, S. Chan, Gradient-aware for class-imbalanced semi-supervised medical image segmentation, in: European Conference on Computer V ision, Springer , 2024, pp. 473–490. [12] S. Adiga, J. Dolz, H. Lombaert, Anatomically-aw are uncertainty for semi-supervised image segmentation, Med- ical Image Analysis 91 (2024) 103011. [13] Q. Ma, J. Zhang, L. Qi, Q. Y u, Y . Shi, Y . Gao, Constructing and exploring intermediate domains in mixed domain semi-supervised medical image se gmentation, in: Proceedings of the IEEE / CVF conference on computer vision and pattern recognition, 2024, pp. 11642–11651. [14] Q. Zeng, Z. Lu, Y . Xie, M. Lu, X. Ma, Y . Xia, Reciprocal collaboration for semi-supervised medical image classiﬁcation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer , 2024, pp. 522–532. [15] Q. Zeng, Y . Xie, Z. Lu, M. Lu, Y . W u, Y . Xia, Segment together: A versatile paradigm for semi-supervised medical image segmentation, IEEE T ransactions on Medical Imaging (2025). [16] Q. Zeng, Z. Lu, Y . Xie, Y . Xia, Pick: Predict and mask for semi-supervised medical image segmentation, Inter- national Journal of Computer V ision (2025) 1–16. [17] D. Shan, Z. Li, Y . Li, Q. Li, J. T ian, Q. Hong, Stpnet: Scale-aware text prompt network for medical image segmentation, IEEE T ransactions on Image Processing (2025). [18] L. Y u, S. W ang, X. Li, C.-W . Fu, P .-A. Heng, Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation, in: Medical image computing and computer assisted intervention–MICCAI 2019: 22nd international conference, Shenzhen, China, October 13–17, 2019, proceedings, part II 22, Springer, 2019, pp. 605–613. [19] Y . Bai, D. Chen, Q. Li, W . Shen, Y . W ang, Bidirectional cop y-paste for semi-supervised medical image segmen- tation, in: Proceedings of the IEEE / CVF Conference on Computer V ision and Pattern Recognition, 2023, pp. 11514–11524. [20] J. Y in, Y . Chen, Z. Zheng, J. Zhou, Y . Gu, Uncertainty-participation context consistency learning for semi- 23 supervised semantic se gmentation, in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2025, pp. 1–5. [21] B. W ei, H. Liu, C. Qian, Y . Jia, W . W u, Z. Li, Decoupling while coupling: T o wards more accurate stereo image sand remov al beyond certainty , in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2025, pp. 1–5. [22] J. Y in, T . Chen, G. Pei, H. Liu, Y . Y ao, L. Nie, X. Hua, Semi-supervised semantic segmentation with multi- constraint consistency learning, IEEE T ransactions on Multimedia (2025). [23] Z. Li, D. Li, C. Xu, W . W ang, Q. Hong, Q. Li, J. Tian, Tfcns: A cnn-transformer hybrid network for medical image segmentation, in: International conference on artiﬁcial neural networks, Springer , 2022, pp. 781–792. [24] D. Shan, Z. Li, W . Chen, Q. Li, J. T ian, Q. Hong, Coarse-to-ﬁne covid-19 segmentation via vision-language alignment, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5. [25] Z. Li, P . Kinahan, Ai-driv en harmonization of radiology study descriptions using m ﬀ net, in: AAPM 66th Annual Meeting & Exhibition, AAPM, 2024. [26] S. Laine, T . Aila, T emporal ensembling for semi-supervised learning, in: International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJ6oOfqge [27] Z. Xiong, Q. Xia, Z. Hu, N. Huang, C. Bian, Y . Zheng, S. V esal, N. Ravikumar , A. Maier , X. Y ang, et al., A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging, Medical image analysis 67 (2021) 101832. [28] X. Luo, J. Chen, T . Song, G. W ang, Semi-supervised medical image se gmentation through dual-task consistency , in: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, V ol. 35, 2021, pp. 8801–8809. [29] K. Clark, B. V endt, K. Smith, J. Freymann, J. Kirby , P . K oppel, S. Moore, S. Phillips, D. Ma ﬃ tt, M. Pringle, et al., The cancer imaging archive (tcia): maintaining and operating a public information repository , Journal of digital imaging 26 (2013) 1045–1057. [30] S. S. Bakas, Brats miccai brain tumor dataset (2020). doi:10.21227/hdtd- 5j88 . URL https://dx.doi.org/10.21227/hdtd- 5j88 [31] R. A. Zeineldin, M. E. Karar , J. Coburger , C. R. W irtz, O. Burgert, Deepse g: deep neural network framew ork for automatic brain tumor segmentation using magnetic resonance ﬂair images, International journal of computer assisted radiology and surgery 15 (2020) 909–920. [32] F . Milletari, N. Na vab, S.-A. Ahmadi, V -net: Fully con volutional neural netw orks for v olumetric medical image 24 segmentation, in: 2016 fourth international conference on 3D vision (3D V), Ieee, 2016, pp. 565–571. [33] S. Li, C. Zhang, X. He, Shape-aw are semi-supervised 3d semantic segmentation for medical images, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23, Springer , 2020, pp. 552–561. [34] W . Ji, et al., Learning calibrated medical image segmentation via multi-rater agreement modeling, in: Proceed- ings of the IEEE / CVF Conference on Computer V ision and Pattern Recognition, 2021, pp. 12341–12351. [35] Y . Liu, W . W ang, G. Luo, K. W ang, S. Li, A contrastiv e consistency semi-supervised left atrium segmentation model, Computerized Medical Imaging and Graphics 99 (2022) 102092. [36] V . V erma, K. Kawaguchi, A. Lamb, J. Kannala, A. Solin, Y . Bengio, D. Lopez-Paz, Interpolation consistency training for semi-supervised learning, Neural Networks 145 (2022) 90–106. [37] Y . W u, Z. Ge, D. Zhang, M. Xu, L. Zhang, Y . Xia, J. Cai, Mutual consistency learning for semi-supervised medical image segmentation, Medical Image Analysis 81 (2022) 102530. [38] J. Miao, C. Chen, F . Liu, H. W ei, P .-A. Heng, Caussl: Causality-inspired semi-supervised learning for medical image se gmentation, in: Proceedings of the IEEE / CVF International Conference on Computer V ision, 2023, pp. 21426–21437. [39] H. Basak, Z. Y in, Pseudo-label guided contrastive learning for semi-supervised medical image segmentation, in: Proceedings of the IEEE / CVF Conference on Computer V ision and Pattern Recognition, 2023, pp. 19786– 19797. [40] T .-H. V u, H. Jain, M. Bucher , M. Cord, P . Pérez, Advent: Adversarial entropy minimization for domain adap- tation in semantic segmentation, in: Proceedings of the IEEE / CVF Conference on Computer V ision and Pattern Recognition, 2019, pp. 2517–2526. 25

Scale-aware Adaptive Supervised Network with Limited Medical Annotations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment