Anatomy-Guided Representation Learning Using a Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images

Reading time: 4 minute
...

📝 Original Info

  • Title: Anatomy-Guided Representation Learning Using a Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images
  • ArXiv ID: 2512.12662
  • Date: 2025-12-14
  • Authors: Muhammad Umar Farooq, Abd Ur Rehman, Azka Rehman, Muhammad Usman, Dong-Kyu Chae, Junaid Qadir

📝 Abstract

Accurate thyroid nodule segmentation in ultrasound images is critical for diagnosis and treatment planning. However, ambiguous boundaries between nodules and surrounding tissues, size variations, and the scarcity of annotated ultrasound data pose significant challenges for automated segmentation. Existing deep learning models struggle to incorporate contextual information from the thyroid gland and generalize effectively across diverse cases. To address these challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-based Network that leverages unlabeled data to enhance Transformer-centric encoder feature extraction capability in an initial unsupervised phase. In the supervised phase, the model jointly optimizes nodule segmentation, gland segmentation, and nodule size estimation, integrating both local and global contextual features. Extensive evaluations on the TN3K and DDTI datasets demonstrate that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and robustness, indicating its potential for real-world clinical applications.

💡 Deep Analysis

Figure 1

📄 Full Content

Anatomy-Guided Representation Learning Using a Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images Muhammad Umar Farooq1, Abd Ur Rehman2, Azka Rehman3, Muhammad Usman4,Dong-Kyu Chae5, Junaid Qadir6 1 Department of Computer Science, Hanyang University, Seoul, 04762, South Korea 2 Department of Computer Science, The University of Alabama, Seoul, 04762, South Korea 3 Department of Biomedical Sciences, Seoul National University, Seoul, 08826, South Korea (azkarehman@snu.ac.kr) 4 Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University, CA 94305, USA (usmanm@stanford.edu) 5 Department of Computer Science, Hanyang University, Seoul, 04762, South Korea (dongkyu@hanyang.ac.kr) 6 Department of Computer Engineering, Qatar University, Doha, Qatar (jqadir@qu.edu.qa) Abstract Accurate thyroid nodule segmentation in ultrasound images is critical for diagnosis and treatment planning. However, ambiguous boundaries between nodules and surrounding tissues, size variations, and the scarcity of annotated ultrasound data pose significant challenges for automated segmentation. Exist- ing deep learning models struggle to incorporate contextual information from the thyroid gland and generalize effectively across diverse cases. To address these challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer- based Network that leverages unlabeled data to enhance Transformer-centric encoder feature extraction capability in an initial unsupervised phase. In the su- pervised phase, the model jointly optimizes nodule segmentation, gland segmen- tation, and nodule size estimation, integrating both local and global contextual features. Extensive evaluations on the TN3K and DDTI datasets demonstrate that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and robustness, indicating its potential for real-world clinical applications. Keywords: Multi-task learning, semi-supervised learning, thyroid nodule segmentation, transformer, ultrasound images. arXiv:2512.12662v1 [cs.CV] 14 Dec 2025 1. Introduction Automated thyroid nodule segmentation in ultrasound imaging plays a piv- otal role in supporting radiologists by improving diagnostic accuracy and re- ducing inter-observer variability [1]. Despite its importance, the task remains challenging due to heterogeneous echogenic patterns, ambiguous boundaries, and strong acoustic shadows commonly present in thyroid ultrasound scans [2]. While deep learning-based segmentation frameworks have demonstrated promis- ing performance [3], their ability to capture long-range dependencies—critical for accurate nodule delineation—remains limited [4, 5]. These limitations be- come more pronounced in cases with substantial variations in nodule morphol- ogy. Existing approaches (e.g., [6, 7]) predominantly rely on single-task learn- ing, often overlooking complementary contextual cues such as thyroid gland structure, gland–nodule spatial interactions, and morphological priors. At the same time, the field faces a persistent shortage of large-scale annotated datasets, restricting model generalizability in real-world conditions. In contrast, recent advances in medical imaging demonstrate the benefits of incorporating auxil- iary tasks and attention-driven architectures across several domains, such as mandibular canal delineation [8], lung nodule segmentation with adaptive ROI selection [9, 10, 11], brain tumor segmentation [12, 13], diabetic retinopathy seg- mentation [14], and broader biomedical detection challenges [15, 16, 17]. These works consistently highlight how multi-scale attention, ROI adaptation, and multi-encoder feature fusion contribute to improved lesion localization and ro- bustness. Transformer-based architectures have further accelerated progress by en- abling superior long-range context modeling, benefiting not only medical image reconstruction [18, 19, 20, 21, 22] but also segmentation and detection tasks across MRI, CBCT, and CT modalities. Their ability to integrate global and local representations has proven effective in diverse clinical workflows, such as cardiomegaly assessment [23], multimodal neuroimaging fusion [24, 25], phono- cardiographic signal analysis [26], and cross-lingual feature representation learn- ing [27]. Collectively, these studies demonstrate that multi-task learning (MTL), multi-encoder architectures, and attention-rich designs consistently outperform strictly single-task CNN systems, especially in complex and low-contrast set- tings. MTL in particular has shown strong potential in enhancing primary task performance by leveraging the inductive bias of complementary auxiliary tasks [28, 29]. In medical imaging, MTL-driven designs have yielded notable improve- ments in lung nodule detection [24], metaverse-based brain-age estimation [20], COVID-19 classification [16, 17], and other clinical prediction tasks [30, 31]. Semi-supervised strategies also continue to play an important role in add

📸 Image Gallery

Ablation_tn3k.png Learning_Courves.png Samples.png network.jpg tn3k_comparision.jpg

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut