Ultrasound Image Segmentation of Thyroid Nodule via Latent Semantic Feature Co-Registration

Ultrasound Image Se gmentation of Thyroid Nodule via Latent Semantic Feature Co-Re gistration Xue wei Li a , b , c , d † , Y aqiao Zhu b , c , d † , Jie Gao a , b , c , Xi W ei e , Ruixuan Zhang a , b , c , Y uan T ian a , b , c , and Zhiqiang Liu a , b , c ∗ a Colle ge of Intelligence and Computing, T ianjin University b T ianjin K e y Laboratory of Cognitive Computing and Application c T ianjin K e y Laboratory of Advanced Networking d School of Future T echnology , T ianjin University e Department of Diagnostic and Therapeutic Ultrasonogr aphy , T ianjin Medical University Cancer Institute and Hospital T ianjin, China Email: { lixuewei, zhuyaqiao, gaojie } @tju.edu.cn, weixi@tmu.edu.cn, { zrx 6566, tiany , tjubeisong } @tju.edu.cn Abstract —Segmentation of nodules in th yroid ultrasound imag- ing plays a crucial role in the detection and tr eatment of thyr oid cancer . Howev er , owing to the diversity of scanner vendors and imaging protocols in differ ent hospitals, the automatic segmentation model, which has already demonstrated expert- level accuracy in the ﬁeld of ultrasound imaging segmentation, ﬁnds its accuracy reduced as the result of its weak general- ization performance when being applied in clinically realistic en vironments. T o address this issue, the present paper proposes ASTN, a novel framework for accurate and generalizable seg- mentation of thyr oid nodules. ASTN incorporates a unique co- registration module that concentrates on the lesion ar ea. This module enhances segmentation accuracy by using anatomical structural information from various datasets. In particular , prior to co-r egistration, an encoder is used to extract the latent semantic features of the lesion area in both the target image and atlas (a template set composed of multiple real anatomical structures). The features ar e thereafter registered subsequent to their combination with one another . For the label fusion, we estimate the similarity of results at different stages, adapti vely allocate weights to each. This strategy aims to improv e the model’ s fault tolerance and segmentation accuracy in differ ent data domains. Additionally , this paper also provides an atlas selection algorithm to retain more priori inf ormation and allevi- ate the challenges associated with co-r egistration. Thanks to the method we proposed, as shown by the ev aluation results collected from the datasets of different devices, the model generalization has been greatly improved while maintaining a high level of segmentation accuracy . Index T erms —Thyroid nodule, Ultrasound image, Deep learn- ing, Registration and segmentation I . I N T RO D U C T I O N Over the past decade, the incidence rate of malignant thyroid nodules has escalated by 65%, ultrasound has become the preferred method for ev aluating thyroid nodules in diagnosis [1], and the precise segmentation of nodules in thyroid ultra- sound imaging has turned into a vital step in the detection and treatment of thyroid cancer . Considering the heavy reliance on the experiences and e xpertise of clinical practitioners in diagnosing thyroid nodules through ultrasound imaging [2], a uni versal and accurate computer-aided diagnostic system is ∗ Corresponding author . † These authors contributed equally to this work. Fig. 1. Comparison of Th yroid Nodule with Other Body P arts essential to prev ent misdiagnosis. One of the k ey challenges faced in the dev elopment of such systems is the variability of training datasets, inﬂuenced by dif ferences in scanner v endors and imaging protocols. This variability often hampers the performance of segmentation models, particularly when they are applied to data from dif ferent domains than those they were trained on. As evidenced in studies led by Z. Su [3] and J. De Fauw [4], models achieving high accuracy in segmenting images from a single device often e xperience more than a twofold increase in error rate when applied to images from multiple devices. This poor generalization capacity has emerged as a fundamental hurdle in applying deep-learning models in clinical settings [5]. In response to the challenge of improving generalization capacity , sev eral research teams have explored domain adap- tation or domain generalization strategies, utilizing data ma- nipulation and adversarial training to address domain shifts [6]–[8]. Howe ver , these methods often come with a signiﬁcant in vestment in terms of human resources and time, both for data collection and model training. Prior to the emergence of po w- erful con volutional netw orks used for pixel-le vel segmentation, the primary approach in se gmenting the biomedical images with known structure is the con ventional co-registration-based method [9], [10], which aligns known anatomical topological structures with target images by using co-registration in order to achieve segmentation, thus requiring less computational re- sources compared to domain generalization approaches. Along with the proposal of Spatial T ransformer Networks (STN) [11], the traditional co-registration process can be realized by using deep networks. Compared with the networks that perform pixel-le vel segmentation [12], networks that employ topological structure co-registration for segmentation are be- liev ed to exhibit superior generalization, particularly when handling data from various scanner vendors and imaging pro- tocols [13]. This assertion forms the basis of our study , which aims to le verage the strengths of topological co-registration within a deep learning context to address the challenges of domain variability in ultrasound image segmentation. In recent years, there ha ve been constant endea vors to improv e the co-registration-based medical imaging segmen- tation. For instance, Bin Huang [14] dev eloped a lightweight neural network that improved the accuracy of segmenting head and neck cancer lesions in org an-at-risk delineation through co-registration techniques. Similarly , Jiazhen Zhang’ s work [15], which utilized an anatomical atlas for post-processing segmentation results, demonstrated the potential for enhancing image segmentation accuracy by dynamically combining atlas images with semantic segmentation outcomes. In order to mitigate the false pixel predictions caused by noise during cardiac segmentation, Atlas-ISTN [9] directly co-registered the constructed atlas labels with the se gmented results after obtaining an initial cardiac se gmentation mask. Ho wev er, these advances encounter unique challenges when applied to thyroid nodule segmentation in ultrasound images. Unlike the body parts mentioned above, thyroid nodules exhibit minimal tex- tural contrast with surrounding tissues, as illustrated in Fig. 1. This similarity in texture has led to lower accuracy in previous co-registration attempts for thyroid nodules. This speciﬁc challenge underscores the need for a more reﬁned approach in ultrasound image segmentation for thyroid nodules, which is the focus of our study . T ABLE I C O MPA R IS O N O F T HE D SC R E SU LT S O F T H E C O - RE G I S TR ATI O N O F D I FFE R E N T DAT A I N CL U D I NG T HY R OI D N O D U L ES ( T N ), L E FT V E NT R I C UL A R M Y OC A R D IU M (LVM ) , L E F T A TR I A L C A V I T Y ( L AC ) , A S CE N D I NG A O RT A ( A A ) , L I V E R , O P TI C C RO SS ( O C) , L U M BA R V E RTE B R A E ( L V ) , A N D L E F T V E N TR I C U LA R C AVI T Y ( L V C ) U S I N G T H E S Y N M E T H OD Data TN L VM LA C AA DSC ↑ 27.3 52.6 64.49 45.55 Data Liver OC L V L VC DSC ↑ 75.13 55.7 74 70.79 In this section, we conducted tests on the thyroid nodules co-registration using symmetric image normalization (SyN) [16], which is regarded as one of the top-performing co- registration methods. Surprisingly , the co-registration Dice Similarity Coefﬁcient (DSC) for thyroid nodules was found to be only 27.3%, signiﬁcantly lower than for other body parts shown in T ABLE I [13], [14], [17], [18]. This stark contrast underscores the unique challenges posed by thyroid nodules, primarily due to their subtle textural differences from surrounding tissues and their unpredictable gro wth positions within ultrasound images. Current methods [19], [20], while effecti ve in aligning the overall image shape, often overlook the critical semantic details of the nodules, leading to a loss of original anatomical integrity . Furthermore, existing atlas selection techniques [21], [22] fall short in ensuring spatial proximity between the lesion re gions in the atlas and the target images, thus affecting the accuracy of co-registration and subsequent segmentation [23]. T o address these challenges, we propose the co-re gistration and segmentation framework ASTN, speciﬁcally tailored for thyroid nodules in ultrasound images. Our approach introduces two innov ative components: a no vel atlas selection algo- rithm and a dictionary system. The atlas selection algorithm, guided by the Regional Correlation Score (RCS), focuses on constructing dictionary elements by selecting regionally representativ e images and labels. This approach aims to bridge the spatial and textural gaps between the atlas and the target lesion areas, thereby simplifying the co-registration process. The dictionary system, comprising a newly dev eloped co- registration network termed ’half-STN’ and an advanced label fusion method, le verages the semantic information of the nodules. It determines the spatial relationship between the dictionary elements and the nodules in the target images. This system marks a signiﬁcant shift in co-registration strat- egy , moving from traditional alignment of superﬁcial image features to a deeper alignment of latent semantic features. By doing so, it generates a Displacement Field (DF) that maintains the integrity of the nodule areas, thus reducing the segmentation discrepancies caused by variations in imaging devices, with the aim of enhancing the generalization of the model. In the present study , the experiments were conducted using datasets from Thyroid Ultrasound Image (TUI) collected by two dif ferent devices. The ke y contributions of this research are summarized as follows: • W e introduce a pioneering co-re gistration-based se gmen- tation method tailored for thyroid nodules in ultrasound images. This method uniquely extracts and utilizes se- mantic information during the co-registration process, ensuring the preservation of critical anatomical structures for more accurate segmentation masks. Additionally , we reﬁne a method for calculating weights during the label fusion of the multi-stage warped atlas. • Our work proposes an atlas selection algorithm. This algorithm is designed to ensure that the atlas’ s data distribution comprehensi vely encompasses the v ariabil- ity observ ed in the target images, thereby signiﬁcantly boosting the co-registration’ s ef fectiv eness. • W e ev aluate the performance of our methodology on the TUI dataset, achieving remarkable results. The DSC for co-registration has been elev ated to 88.59%, and the IoU of the segmentation results increased by 1.34% and 6.524% in the known and unseen domains compared to existing methods. I I . M E T H O D O L O G Y In this section, we detail the implementation methodology of ASTN, illustrated in Fig. 4. The frame work comprises two primary components: a novel atlas selection algorithm and a Fig. 2. Regional Correlation Score. The picture above illustrates the RCS computation process when M is 9. The red dots represent the centroids C m of the m region, and the black dot represents the centroid C ′ of the entire nodule. The number of nodule pixels in the current region is N w , and the number of background pixels is N b dictionary system. The atlas selection algorithm is integral for constructing the dictionary atlas, which forms the foun- dation for our targeted segmentation approach. Meanwhile, the dictionary system plays a pi votal role in the se gmentation process, consisting of two specialized components. The ﬁrst, the Semantic Extraction part (highlighted in green in the ﬁgure), is dedicated to accurately localizing nodule features within the ultrasound images. The second component, the Deformation Fusion part (highlighted in red), is responsible for warping and integrating the atlas labels based on the features. A. Regional Corr elation Scor e Addressing the complexities inherent in co-registering thy- roid nodules in ultrasound imaging, our methodology con- fronts the challenges posed by the variability of imaging positions and the sparse distribution of nodules. T o mitigate the impact of these challenges, our atlas selection algorithm strategically utilizes the RCS to assemble an atlas that ef- fectiv ely represents thyroid ultrasound images with a uniform distribution of nodules. The novel atlas selection algorithm calculates the RCS for each candidate image. T o do this, we ﬁrst dissect each candidate image into a grid of regions, denoted as u ro ws by v columns, resulting in u × v = M total re gions, as illustrated in Fig. 2. For each region in a giv en candidate (labelled as a ), the RCS computation is carried out as follows: Standard De viation of Proportion ( S td ( P m ) ): This is calculated as P m = N w N w + N b (1) where N w represents the number of pixels in the nodule within region m , and N b denotes the background pixels in the same region. P m thus signiﬁes the proportion of the nodule area within that speciﬁc region. Fig. 3. Dictionary System . Half-STN denoted as HS, W A stands for warped atlas Standard Deviation of Centroid Distance ( S td ( D m ) ): Here, D m = q ( C ′ x − C mx ) 2 + ( C ′ y − C my ) 2 (2) which calculates the Euclidean distance between the centroid of the nodule area and the centroid of region m . The coor- dinates of these centroids are denoted as ( C mx , C my ) for the region and ( C ′ x , C ′ y ) for the nodule area. This distance metric helps in un derstanding ho w centrally located the nodule is within each region. The RCS for each region ( s m ) is then deﬁned by the formula: s m = S td ( P m ) − S td ( D m ) (3) which captures both the spatial distribution and the centrality of the nodules within each region, providing a comprehensi ve score. Higher scores indicate a stronger correlation between the candidate image and the target region, suggesting a more accurate alignment potential with nodules in that region. Once the RCS for all candidates and regions has been com- puted, we aggregate these scores to form the atlas dictionary . The atlas is then assembled by selecting the candidate with the highest RCS for each region, yielding the ﬁnal atlas as: Atlas = { ar gmax ( S m ) , m = 1 , 2 , · · · , M } (4) B. Dictionary System a) Half-STN Model: T o maintain anatomical integrity during co-re gistration, we propose the Half-STN (HS) model, which utilizes semantic information of ultrasound images de- riv ed from a segmentation netw ork to generate DF . Compared to the DF obtained from the entire image using STN, in-depth semantic information a voids blind pix el-lev el alignment that can compromise topology structure, resulting in more faithful deformation outcomes. A representative segmentation network Unet is used as the backbone in this section to facilitate description. Since an initial segmentation result is needed in the sub- sequent label fusion stage to assign an appropriate weight, Fig. 4. Overview of our ASTN. ASTN encompasses the components of Atlas Selection, Semantic Extraction, and Deformation Fusion. During the training, a ﬁxed selected atlas is inputted into the network along with the target ultrasound image(target US). The Deformation Fusion module lev erages the semantic features and the initial segmentation result provided by Semantic Extraction to generate the ﬁnal segmentation. The Feature Combination and W eighted label Fusion are depicted in Fig. 5 and Fig. 6, respectiv ely . Fig. 5. Feature Combination. The upper blue encoder receives I A , gener- ating features of dimensions M × N . The lower green encoder receives the I T and obtains features of dimensions 1 × N . After combining the two into a M × N dimensional feature, it is used as an input to the red half-STN denoted as V in Fig. 3, to each warped atlas, the Unet is ﬁrst trained for se gmentation independently before b uilding the whole framework. This process enables the encoder of Unet to extract semantic features of nodules from ultrasound images, which serve as the information source for generating the DF in the HS. During performing co-registration, it is typically necessary to input both the ﬁxed and moving images, which corresponds to the target feature Q and the atlas feature K in our HS method. The manner in which semantic features are obtained is as follows: Q = U enc ( I T ) , K A = U enc ( I A ) (5) where U enc denotes the encoder of Unet, I T denotes the target image, there are M images in the atlas I A , whose features are deﬁned as K A = { f a , a = 1 , 2 , · · · , M } . In the subsequent training, it is essential to maintain the feature extraction capacity of U P 1 and assure the rationality of the initial segmentation result, so the following loss function will serve as an item of the ov erall loss of ASTN, enabling Unet to activ ely engage in the holistic optimization of the model: L sim ( S eg inital , L T ) = ( ∥ S eg inital − L T ∥ 2 ) 2 S eg inital = U dec ( Q ) (6) where U dec denotes the decoder of Unet, L T denotes the target label, and mean square error (MSE) is used to compute the segmentation loss of Unet L sim . In order for the half-STN to capture the correspondence between the ﬁxed image and moving image, the combination of Q and K is performed by broadcasting the dimensions and utilizing the concatenation method as depicted in Fig. 5. Additionally , to preserv e the high-dimensional information, the information propagated through each skip connection requires individual concatenation. T o discern the spatial positional relationship between the implicit features of nodules extracted from Q and K A , we hav e devised a Half-STN network that is capable of capturing the DF through the features. When constructing the Half-STN, we employe a network architecture similar to the decoder structure in Semantic Extraction, with the aim of enhancing the univ ersality of the ASTN framework while enabling compar- ison of co-registration performance across different structural decoders. The co-registration process is outlined as follows: e L A = L A ◦ ( I d + D F ) D F A = H S ( Q, K A ) (7) where D F A is the DF of I A tow ards L T , wherein the size matches that of the original image and encompasses the coordinates of each pixel in the original image after the variations. e L A denotes the warped atlas labels, while I d represents the identity transform [24]. At this stage, the framew ork incorporates the follo wing loss function to optimize the Half-STN: L reg ( e L A , L T , D F ) = L sim ( e L A , L T ) + λ 1 L sot ( D F ) (8) where L sim is equi valent to the loss function used in seg- mentation. When e L A and L T are e xcessiv ely similar , the D F Algorithm 1: The Algorithm of Dictionary System Input: Training dataset D T ; Atlas A ; Batch Size B Output: Optimal Enc, Dec, HS 1: Θ E nc , Θ Dec , Θ H S , ← initialize network randomly // Initialize the encoder and decoder of seg network parameters and Half-STN parameters 2: for iter in r ange ( S eg E poch ) do 3: Sample a mini-batch ( X , Y ) = { ( x i , y i ) } |B| i =1 from D T 4: F ← E nc ( X ) // Feature of X 5: ˜ y 0 ← D ec ( F ) // Initial seg result 6: L seg ← ( ∥ ˜ y − Y ∥ 2 ) 2 // Calculate seg loss / * Update seg network parameters * / 7: Θ E nc ← Θ E nc − ∇ Θ E nc L seg 8: Θ Dec ← Θ Dec − ∇ Θ Dec L seg 9: end for 10: for iter in r ange ( Reg E poch ) do 11: Sample a mini-batch ( X , Y ) = { ( x i , y i ) } |B| i =1 from D T 12: Get atlas images and labels ( X A , Y A ) from A 13: F , F A ← E nc ( X , X A ) 14: ˜ y 0 ← D ec ( F ) 15: L seg ← ( ∥ ˜ y − Y ∥ 2 ) 2 // Calculate seg loss 16: W ← H S ( F , F A ) // Get warped labels 17: Compute V using Eq. (10) 18: ˜ y ← P V · W ﬁnal prediction 19: Compute L reg using Eq. (8) / * Update all network parameters * / 20: Θ E nc ← Θ E nc − ∇ Θ E nc ( L seg + λ L reg ) 21: Θ Dec ← Θ Dec − ∇ Θ Dec L seg 22: Θ H S ← Θ H S − ∇ Θ E nc ( λ L seg + L reg ) 23: end for employed for co-registration loses its smoothness, which is impractical in biological anatomy . Hence, we restrict D F to maintain its smoothness by utilizing the diffusion regularizer L sot on the displacement ﬁeld’ s planar gradient. The coefﬁ- cient λ 1 signiﬁes the smoothness loss factor . Controlling for the effect of se gmentation loss with λ 2 , the ﬁnal loss L for the whole network is: L = L reg ( e L A , L T , D F ) + λ 2 L sim ( S eg inital , L T ) (9) b) W arped label fusion: T o effecti vely address the chal- lenges posed by the dispersed distribution of nodules in the at- las and the resultant inconsistencies in the precision of warped labels obtained through co-re gistration, we have dev eloped an innov ative label fusion strategy . This approach is designed to enhance the fault tolerance of the co-registration process. It ensures that the ﬁnal segmentation mask remains accurate, ev en in the presence of co-registration errors in indi vidual atlas elements. The strategy , including its application to the initial segmentation as a reference, is detailed in Fig. 6. Fig. 6. W arped label fusion. The initial segmentation is ev aluated against each warped label using DSC, yielding respective weights upon normalization as indicated by the red arrow . The fusion process, depicted by the grey arrow , inv olves the weighted summation of warped labels to obtain the ﬁnal segmentation outcome. In our no vel approach to assigning weights to each warped label, we di ver ge from traditional methods that rely heavily on the information itself, as done by S. K. W arﬁeld [25], or the creation of a subnetwork as suggested by Long Xie [13]. Instead, we adopt a more streamlined method that maintains accuracy while minimizing the number of required parame- ters. This is achie ved by le veraging the initial segmentation ( S eg initial ) obtained from the Semantic Extraction process. The DSC is calculated between the warped label e L A = { ˜ l a , a = 1 , 2 , · · · , M } and S eg initial , and after normalization, obtain the weights v a for the current warped label. In practice, in order to fully exploit the target image information, the fusion approach incorporates S eg initial in the ﬁnal weighted summation. Additionally , a weight of v 0 = 1 M +1 is assigned to S eg initial . The expression for v a is as follo ws: v a = D a v 0 + P M a =1 D a D a = D S C ( ˜ l a , S eg inital ) (10) where D a denotes the DSC between the current w arped label ˜ l a and the initial segmentation. After getting all the weights v A , the fusion of the obtained segmentation and co- registrations W A = { S eg initial , e L A } is performed. The fused result, which represents the output of the entire model, is as follows: output = M X i =1 v i · w i ; v i ∈ V A , w i ∈ W A (11) I I I . E X P E R I M E N T S A N D R E S U LT S A. Datasets W e conducted experiments using the TUI dataset collected from collaborating hospital, comprising 11,360 images from the P1 device and 800 images from the M3 device. The P1 dataset was utilized to assess the co-registration and segmentation performance of the model, encompassing 4,796 benign nodule images and 6,564 malignant nodule images, all expertly annotated. T o maintain class balance, we randomly Fig. 7. Comparison Experiment. The ﬁrst column shows the target ultrasound images, while the last column shows the annotated labels made by experts. The remaining columns e xhibit the forecasting outcomes of different models. Nodule areas are depicted in white. The co-registration process of the red target is illustrated in Fig. 8. selected 4,500 images from both benign and malignant nodule images, as the P1 training set, while the remaining 2,360 images served as the P1 test set. Furthermore, all images were resized to (224 , 224) using bilinear interpolation. The M3 dataset consisted of 400 benign nodule images and 400 malignant nodule images, serving as the e valuation set for the generalization of ASTN. Moreover , the M3 dataset underwent the same preprocessing steps as the P1 dataset. B. Model Settings and Metrics The PyT orch frame work was used for all model dev elop- ment. All codes were executed on an NVIDIA R TX3090 (with 24GB memory). W e trained for 120 epochs, using RMSprop as the optimizer to optimize the model parameters. The learning rate of the Semantic Extraction network in the model was 0.00001, while the Deformation Fusion network had a learning rate of 0.0001. The learning rate decayed by 0.1 every 40 epochs. The batch size remained constant at 6. The dimension size of the latent space features of each ultrasound image was controlled at 1 ∗ 1024 , and the dimension of DF was twice the size of the input image, i.e., 2 ∗ 224 ∗ 224 . W e v alidated the model’ s effecti veness through compara- tiv e and ablation experiments. These experiments veriﬁed the model’ s segmentation and registration ef fects from different angles. When comparing segmentation effects, we used com- mon segmentation ev aluation indicators. The Dice Similarity Coefﬁcient (DSC, higher is better) measures the similarity between two images. It does this by calculating the ratio of the intersection size to the total pixels of predicted and real seg- mentation images. The Intersection ov er Union (IoU, higher is better), also known as the Jaccard coefﬁcient, is similar to DSC, but IoU is less affected by changes in segmentation category size. The Hausdorff Distance (HD, lower is better) calculates the maximum distance between predicted segmen- tation and real segmentation images, capturing mismatched areas of segmentation and providing a more detailed accuracy assessment. The A verage Symmetric Surface Distance (ASSD, lower is better) calculates the average distance between pre- T ABLE II C O MPA R IS O N O F S EG M E NTA T I O N R E SU LT S W I TH S T A T E - OF - T HE - A RT S E G ME N TA T I O N N E TW O R K S F O R TH Y RO I D N O D UL E S O N D IFF E R E NT DAT A S E T S , I N CL U D I NG FI VE C O MPA R IS O N M E T H OD S AN D A ST N W IT H EA C H M E TH O D A S T HE BA CK B O NE , A ND T H E R E SU LT S O F E AC H ME T R I C A R E G I VE N I N T H E F O RM O F M E A N ± S TAN DA R D D E V I A T I O N . Method P1 M3 DSC ↑ IOU ↑ HD ↓ ASSD ↓ DSC ↑ IOU ↑ HD ↓ ASSD ↓ FDNSCL(2023) 91.09±4.1 83.63±2.1 4.958±2.2 0.3823±0.22 76.17±6.5 61.51±3.4 14.78±2.6 0.8194±0.62 ASTN-FDNSCL 91.88±2.6 84.97±1.3 4.846±1.5 0.1873±0.26 79.52±5.9 66.00±1.4 11.31±2.0 0.6564±0.41 T ransdeeplab(2022) 89.89±2.5 81.63±1.2 6.487±3.1 0.3238±0.28 75.19±6.6 60.24±3.4 11.68±8.5 0.8319±0.74 ASTN-T ransdeeplab 90.76±7.3 83.08±3.8 6.357±2.1 0.4630±0.44 83.15±3.3 71.16±1.7 10.58±2.1 0.7236±0.68 SegNet(2017) 93.22±3.5 87.30±1.8 4.849±2.5 0.1953±0.18 74.74±6.3 59.67±3.3 12.56±1.5 1.059±0.83 ASTN-SegNet 92.15±3.2 85.44±1.6 5.194±3.7 0.1713±0.12 80.39±6.6 67.21±3.4 11.29±9.0 0.8682±0.58 Decon vNet(2015) 88.57±4.7 79.48±2.4 9.860±3.2 0.3627±0.29 68.46±3.4 52.05±1.7 25.62±8.6 1.679±0.84 ASTN-Decon vNet 91.73±3.2 84.72±1.6 10.83±3.9 0.2176±0.28 72.21±4.2 56.51±2.1 16.06±6.2 1.068±0.57 Unet(2015) 92.27±4.9 85.65±2.5 6.893±2.0 0.1493±0.15 72.78±5.1 57.21±2.6 36.41±9.3 2.457±1.7 ASTN-Unet 92.57±3.6 86.18±1.8 6.026±2.8 0.1388±0.17 76.86±4.8 62.42±2.5 24.53±7.5 1.843±0.39 dicted segmentation and real segmentation images, similar to HD, but pro vides quantiﬁcation of symmetric errors, which may be more informati ve for certain application scenarios. There are described as: D S C = (2 ∗ T P ) (2 ∗ T P + F P + F N ) I oU = T P ( T P + F P + F N ) H D ( P, Q ) = max ( hd ( P , Q ) , hd ( Q, P )) AS S D = 1 2 ( AS D ( P, Q ) + AS D ( Q, P )) (12) where T P represents the number of true positiv e pixels (predicted as positiv e and actually positi ve), F P represents the number of false positive pixels (predicted as positive but actually negati ve), and F N represents the number of false negati ve pixels (predicted as neg ativ e but actually positiv e). P and Q respectiv ely represent the predicted segmentation image and the actual segmentation image. hd ( P , Q ) represents the shortest distance from each pixel in P to Q , that is, the nearest neighbor distance. AS D ( P , Q ) represents the a verage distance from each pix el in P to Q , that is, the a verage distance between all pixels. C. Comparison Experiment a) Comparison with Segmentation Models: T o ev alu- ate the precision and generality of ASTN, we employed common segmentation networks FC-DenseNet+SA+C-LSTM (FDNSCL) [26], Transdeeplab [27], SegNet [28], Decon vNet [29], Unet [30] as the backbone, respectively , and compared the segmentation results with these networks on the P1 and M3 datasets, as summarized in T ABLE II. For a comprehensi ve ev aluation, we trained each model on the P1 training set and then tested it on both the P1 test set and the M3 dataset. The ev aluations are conducted from v arious perspecti ves using the commonly employed metrics mentioned abo ve. When tested on the P1 dataset, which shares the same domain as the train- ing data, the improvements in segmentation performance were marginal across all compared methods. ASTN showed a slight advantage in the DSC metric, especially when integrated with Decon vNet, Unet, FDNSCL, or Transdeeplab as the backbone. Howe ver , on the unseen domain M3, as indicated on the right side of T ABLE II, our method signiﬁcantly outperforms existing methods in all metrics, with DSC and IoU e xceeding the comparativ e methods by 5% and 6.524%, respecti vely . This demonstrates the superiority of our approach in thyroid nodule segmentation compared to all the alternative methods, particularly in the case of unseen domain, as sho wn in Fig. 7. Moreov er , the robust performance of ASTN across various encoder-decoder architectures underscores its versatility and adaptability in dif ferent segmentation model framew orks. T ABLE III C O MPA R IS O N O F D SC R ES U LTS O F D I FF ER E N T M E T HO D S I N C O - R E G I ST R A T I ON S T AG E AN D F US I O N S TAG E Stage Method DSC ↑ (%) P1 M3 I a SyNCC 29.01 36.57 V oxelmorph 54.92 52.06 hu 40.18 40.68 U-ReSNet 63.73 56.15 ASTN w .o c WLF(ours) 88.59 71.38 II b MV 89.8 73.95 ST APLE 90.48 73.13 ASTN(ours) 92.57 76.86 a co-registration stage. b fusion stage. c w .o, without. b) Comparison with co-Registr ation Models: In this seg- ment of our experimental ev aluation, we focused on assessing ASTN’ s co-registration performance in segmenting thyroid Fig. 8. Co-Registration. T o study the effecti veness of co-registration, the process of co-registering atlas to one of the target images is visualized. The atlas is shown on the right, the w arped atlas on the left, and the corresponding deformation ﬁeld (DF) for each element of the atlas is shown in the middle. The nodule area of the warped atlas is similar to the target, thereby substantiating the efﬁcacy and rationality of our model. nodules. T o isolate the impact of co-registration, we temporar- ily deactiv ated the W eighted Labels Fusion (WLF) component within ASTN. This allowed for a direct comparison between the warped labels generated by ASTN’ s Spatial T ransform component and the annotated segmentation labels provided by medical professionals. Additionally , we benchmarked ASTN against well-established co-registration methods, including V ox elMorphV oxelMorph [31], SyNCC [32], Hu [33], and U- ResNet [34]. From the T ABLE III Stage I, it can be observed that prior to fusion, the co-registration DSC of ASTN already reaches 88.6%, signiﬁcantly outperforming the e xisting meth- ods by an a verage DSC margin of over 15%. T o visually demonstrate the co-registration efﬁcac y , we included in Fig. 8 a detailed representation of a speciﬁc segmentation process. This ﬁgure illustrates the atlas images and labels both before and after co-registration, showcasing the precise alignment of nodule areas from the atlas labels to the target. Notably , the ultrasound images re veal a marked closeness of nodule areas between the atlas and target post-co-registration, afﬁrming the v alidity of our warped atlas labels and enhancing the interpretability of our method. At the same time, this experiment shed light on the effec- tiv eness of the WLF component in ASTN. The comparison of the network’ s performance, with and without WLF , showed av erage DSC v alues of 92.57% and 88.6%, respectiv ely , as recorded in T ABLE III. It demonstrates that WLF contributes to a roughly 4% improvement in the ﬁnal segmentation. c) Comparison with Label Fusion Method: In this sec- tion, we compare two common label v oting algorithms. The majority voting (MV) [35] assumes equal fusion weights for all atlases, while the ST APLE [25] utilizes warped atlas labels to estimate the true probabilities of segmentation. For this comparison, we employ the Spatial Transform component output as the input for the voting algorithms. The results are presented in T ABLE III Stage II. On datasets P1 and M3, ASTN achie ves higher DSC and HD results compared to MV and ST APLE methods. These ﬁndings demonstrate the efﬁcac y of our warped label weight calculation method. D. Hyperparameter and Ablation study a) Impact of Size of Atlas: W e explored the impact of varying the size of the atlas within the RCS atlas selection algorithm on the P1. Experiments are performed on the label fusion algorithms of MV , ST APLE, and ours, with M v alues of { 2 , 4 , 6 , 8 , 9 , 12 } , as shown in Fig. 9. Compared to the other two methods, the label fusion approach in ASTN is more sensitiv e to the number of atlas elements and achiev es the optimal result when M is 9. b) Impact of RCS: While validating the efﬁcac y of RCS, we conduct a comparativ e analysis on the inﬂuence of different atlas on co-registration and segmentation performance. Specif- ically , we keep the v alue of M constant at 9 and compute the ﬁnal segmentation DSC using two distinct atlases: one formed using the RCS and the other selected randomly . The results, as sho wn in T ABLE IV, re veal that the atlas chosen based on RCS e xhibited an average DSC of 92.57%, whereas that for the randomly selected atlas is a mere 71.6%. A subset of the atlas selected using the RCS on P1 is visually represented in Fig. 8. Fig. 9. The impact of varying atlas size M . The horizontal axis represents the size of atlas, while the vertical axis represents the ﬁnal segmentation DSC. Different colored curves correspond to distinct fusion Methods. Both the visual observ ations and quantitative results pro vide evidence that the atlas selected based on our method exhibits a uniform dispersion of nodules, indicating a high le vel of tolerance in nodule positioning during the co-registration process to target images. T ABLE IV A B LAT IO N ST U DY ON A S TN Method DSC ↑ (%) P1 M3 ASTN w .o a RCS 71.63 42.83 ASTN w .o DecOfSeg b 61.4 51.09 ASTN 92.57 76.86 a w .o, without. b DecOfSeg, decoder of segmentation c) Impact of Decoder of Se g: T o validate the contrib ution of the segmentation network decoder , we train the ASTN without it. This modiﬁcation led to performance metrics on par with mainstream co-registration methods, yielding an av erage Dice Similarity Coefﬁcient (DSC) of 61.4%. Since the original segmentation decoder is removed, joint training of segmentation and co-registration is not possible, and the absence of the segmentation loss function L sim in network optimization meant that the encoder is not constrained to focus on the semantic information of the nodule area. Consequently , the accuracy of the resulting model without the decoder was on par with standard co-registration methods, b ut markedly inferior to our full ASTN framew ork. I V . C O N C L U S I O N In this paper , we propose an end-to-end deep neural net- work, ASTN, and a new atlas selection algorithm to address the problem of thyroid ultrasound image nodule segmentation. ASTN is distinguished by its ability to conﬁne the regis- tration network’ s operation to the lesion area by extracting latent space features. It also completes se gmentation by better preserving the biological anatomical structure with reference to the prior information in the atlas. Furthermore, our in- nov ative atlas selection algorithm, the Regional Correlation Score, optimizes the distribution of elements within the atlas, thereby enhancing registration accuracy . The new label fusion strategy combines information from the target at multiple scales, impro ving the tolerance for registration and enhancing segmentation quality . Our comparative studies with recent methodologies and classical segmentation networks rev eal signiﬁcant improv ements in DSC, IoU, HD, and ASSD metrics by 4.958%, 6.524%, 5.456%, and 3.374% respectiv ely , across different v endor ultrasound image datasets. These advance- ments are primarily a result of separately optimizing feature extraction and spatial transformation processes using distinct loss functions, achie ving localization of the nodule area before deformation, and ﬁltering out redundant information from other body parts in the image. Once registration focuses more on the lesion area, the overall accurac y of registration and segmentation results will signiﬁcantly improve. Despite these promising results, the clinical applicability of ASTN still needs to be validated through more different datasets, especially datasets of dif ferent imaging modalities, such as X-ray , CT , etc. This future direction calls for ex- tended collaborations with medical institutions or access to div erse public datasets. Additionally , while our atlas selection algorithm performs well with single-nodule thyroid ultrasound datasets, the presence of multiple nodules introduces new complexities. In the future, we will focus on proposing new atlas construction methods to solve this problem, such as replacing the ﬁxed atlas with an atlas that can be updated with the model, which could be another potential direction for ultrasound image segmentation. D E C L A R A T I O N O F C O M P E T I N G I N T E RE S T The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper . D A TA A V A I L A B I L I T Y The authors do not have permission to share data. A C K N O W L E D G M E N T S The authors of this paper w ould like to thank Dr . Xi W ei for providing data support from the T ianjin Medical University Cancer Institute and Hospital for the project research. R E F E R E N C E S [1] G. Li, R. Chen, J. Zhang, K. Liu, C. Geng, and L. L yu, “Fusing enhanced transformer and large k ernel cnn for malignant thyroid nodule segmentation, ” Biomedical Signal Processing and Control , vol. 83, p. 104636, 2023. [2] Y . Y uan, C. Li, L. Xu, S. Zhu, Y . Hua, and J. Zhang, “Csm-net: Automatic joint segmentation of intima-media complex and lumen in carotid artery ultrasound images, ” Computers in Biology and Medicine , vol. 150, p. 106119, 2022. [3] Z. Su, K. Y ao, X. Y ang, K. Huang, Q. W ang, and J. Sun, “Rethinking data augmentation for single-source domain generalization in medical image segmentation, ” in Pr oceedings of the AAAI Conference on Artiﬁ- cial Intelligence , vol. 37, pp. 2366–2374, 2023. [4] J. De Fauw , J. R. Ledsam, B. Romera-P aredes, S. Nikolov , N. T omasev , S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. V isentin, et al. , “Clinically applicable deep learning for diagnosis and referral in retinal disease, ” Natur e medicine , vol. 24, no. 9, pp. 1342–1350, 2018. [5] L. Zhang, X. W ang, D. Y ang, T . Sanford, S. Harmon, B. T urkbey , B. J. W ood, H. Roth, A. Myronenko, D. Xu, et al. , “Generalizing deep learning for medical image segmentation to unseen domains via deep stacked transformation, ” IEEE transactions on medical imaging , vol. 39, no. 7, pp. 2531–2540, 2020. [6] J. W ang, C. Lan, C. Liu, Y . Ouyang, T . Qin, W . Lu, Y . Chen, W . Zeng, and P . Y u, “Generalizing to unseen domains: A survey on domain generalization, ” IEEE T ransactions on Knowledge and Data Engineering , 2022. [7] C. Chen, Q. Dou, H. Chen, J. Qin, and P . A. Heng, “Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image se gmentation, ” IEEE transactions on medical imaging , vol. 39, no. 7, pp. 2494–2505, 2020. [8] C. Chen, Z. Chen, B. Jiang, and X. Jin, “Joint domain alignment and discriminativ e feature learning for unsupervised deep domain adapta- tion, ” in Pr oceedings of the AAAI confer ence on artiﬁcial intelligence , vol. 33, pp. 3296–3303, 2019. [9] M. Sinclair, A. Schuh, K. Hahn, K. Petersen, Y . Bai, J. Batten, M. Schaap, and B. Glocker, “ Atlas-istn: joint segmentation, registration and atlas construction with image-and-spatial transformer netw orks, ” Medical Image Analysis , vol. 78, p. 102383, 2022. [10] I. Isgum, M. Staring, A. Rutten, M. Prokop, M. A. V ierge ver , and B. V an Ginneken, “Multi-atlas-based segmentation with local decision fusion—application to cardiac and aortic se gmentation in ct scans, ” IEEE transactions on medical imaging , vol. 28, no. 7, pp. 1000–1010, 2009. [11] M. Jaderber g, K. Simonyan, A. Zisserman, et al. , “Spatial transformer networks, ” Advances in neural information pr ocessing systems , vol. 28, 2015. [12] Y . Chen, G. Lin, S. Li, O. Bourahla, Y . Wu, F . W ang, J. Feng, M. Xu, and X. Li, “Banet: Bidirectional aggregation network with occlusion handling for panoptic segmentation, ” in Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern recognition , pp. 3793–3802, 2020. [13] L. Xie, L. E. Wisse, J. W ang, S. Ravikumar , P . Khandelwal, T . Glenn, A. Luther, S. Lim, D. A. W olk, and P . A. Y ushkevich, “Deep label fusion: A generalizable hybrid multi-atlas and deep conv olutional neural network for medical image se gmentation, ” Medical Image Analysis , vol. 83, p. 102683, 2023. [14] B. Huang, Y . Y e, Z. Xu, Z. Cai, Y . He, Z. Zhong, L. Liu, X. Chen, H. Chen, and B. Huang, “3d lightweight network for simultaneous registration and segmentation of org ans-at-risk in ct images of head and neck cancer , ” IEEE T ransactions on Medical Imaging , v ol. 41, no. 4, pp. 951–964, 2021. [15] J. Zhang, R. V enkataraman, L. H. Staib, and J. A. Onofrey , “ Atlas-based semantic se gmentation of prostate zones, ” in International Conference on Medical Image Computing and Computer-Assisted Intervention , pp. 570–579, Springer, 2022. [16] B. B. A vants, C. L. Epstein, M. Grossman, and J. C. Gee, “Symmetric diffeomorphic image registration with cross-correlation: evaluating auto- mated labeling of elderly and neurodegenerati ve brain, ” Medical image analysis , vol. 12, no. 1, pp. 26–41, 2008. [17] W . Ding, L. Li, X. Zhuang, and L. Huang, “Cross-modality multi-atlas segmentation via deep registration and label fusion, ” IEEE Journal of Biomedical and Health Informatics , vol. 26, no. 7, pp. 3104–3115, 2022. [18] W . Ding, L. Li, X. Zhuang, and L. Huang, “Cross-modality multi-atlas segmentation using deep neural networks, ” in International Conference on Medical Image Computing and Computer-Assisted Intervention , pp. 233–242, Springer, 2020. [19] J. W u, Q. Y ang, and S. Zhou, “Latent shape image learning via disentangled representation for cross-sequence image registration and segmentation, ” International Journal of Computer Assisted Radiology and Surgery , vol. 18, no. 4, pp. 621–628, 2023. [20] L. Qiu and H. Ren, “Rsegnet: A joint learning framew ork for deformable registration and segmentation, ” IEEE Tr ansactions on Automation Sci- ence and Engineering , vol. 19, no. 3, pp. 2499–2513, 2021. [21] J. V an Houtte, E. Audenaert, G. Zheng, and J. Sijbers, “Deep learning- based 2d/3d registration of an atlas to biplanar x-ray images, ” Interna- tional Journal of Computer Assisted Radiology and Surgery , vol. 17, no. 7, pp. 1333–1342, 2022. [22] Y . Zhang, J. W u, Y . Liu, Y . Chen, W . Chen, E. X. W u, C. Li, and X. T ang, “ A deep learning framew ork for pancreas segmentation with multi- atlas registration and 3d level-set, ” Medical Image Analysis , vol. 68, p. 101884, 2021. [23] Z. Ding and M. Niethammer , “ Aladdin: Joint atlas building and diffeo- morphic registration learning with pairwise alignment, ” in Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pp. 20784–20793, 2022. [24] R. Bajcsy and S. Kov a ˇ ci ˇ c, “Multiresolution elastic matching, ” Computer vision, graphics, and image processing , vol. 46, no. 1, pp. 1–21, 1989. [25] S. K. W arﬁeld, K. H. Zou, and W . M. W ells, “Simultaneous truth and performance le vel estimation (staple): an algorithm for the validation of image segmentation, ” IEEE transactions on medical imaging , vol. 23, no. 7, pp. 903–921, 2004. [26] A. Rondinella, E. Crispino, F . Guarnera, O. Giudice, A. Ortis, G. Russo, C. Di Lorenzo, D. Maimone, F . Pappalardo, and S. Battiato, “Boosting multiple sclerosis lesion segmentation through attention mechanism, ” Computers in Biology and Medicine , vol. 161, p. 107021, 2023. [27] R. Azad, M. Heidari, M. Shariatnia, E. K. Aghdam, S. Karimijafarbigloo, E. Adeli, and D. Merhof, “Transdeeplab: Con volution-free transformer- based deeplab v3+ for medical image segmentation, ” in International W orkshop on PRedictive Intelligence In MEdicine , pp. 91–102, Springer, 2022. [28] V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- volutional encoder-decoder architecture for image segmentation, ” IEEE transactions on pattern analysis and machine intelligence , vol. 39, no. 12, pp. 2481–2495, 2017. [29] H. Noh, S. Hong, and B. Han, “Learning deconv olution network for semantic segmentation, ” in Proceedings of the IEEE international confer ence on computer vision , pp. 1520–1528, 2015. [30] O. Ronneberger , P . Fischer , and T . Brox, “U-net: Con volutional networks for biomedical image segmentation, ” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Confer ence, Munich, Germany, October 5-9, 2015, Proceedings, P art III 18 , pp. 234–241, Springer , 2015. [31] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V . Dalca, “V oxelmorph: a learning frame work for deformable medical image registration, ” IEEE transactions on medical ima ging , vol. 38, no. 8, pp. 1788–1800, 2019. [32] B. B. A vants, N. Tustison, G. Song, et al. , “ Advanced normalization tools (ants), ” Insight j , vol. 2, no. 365, pp. 1–35, 2009. [33] Y . Hu, M. Modat, E. Gibson, W . Li, N. Ghavami, E. Bonmati, G. W ang, S. Bandula, C. M. Moore, M. Emberton, et al. , “W eakly-supervised con- volutional neural networks for multimodal image registration, ” Medical image analysis , vol. 49, pp. 1–13, 2018. [34] T . Estienne, M. V akalopoulou, S. Christodoulidis, E. Battistela, M. Ler- ousseau, A. Carre, G. Klausner, R. Sun, C. Robert, S. Mougiakakou, et al. , “U-resnet: Ultimate coupling of registration and segmentation with deep nets, ” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Confer ence, Shenzhen, China, October 13–17, 2019, Proceedings, P art III 22 , pp. 310–319, Springer , 2019. [35] R. A. Heckemann, J. V . Hajnal, P . Aljabar , D. Rueckert, and A. Ham- mers, “ Automatic anatomical brain mri segmentation combining label propagation and decision fusion, ” Neur oImage , v ol. 33, no. 1, pp. 115– 126, 2006.

Ultrasound Image Segmentation of Thyroid Nodule via Latent Semantic Feature Co-Registration

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment