Title: Anatomy-Guided Representation Learning Using a Transformer-Based Network for Thyroid Nodule Segmentation in Ultrasound Images
ArXiv ID: 2512.12662
Date: 2025-12-14
Authors: Muhammad Umar Farooq, Abd Ur Rehman, Azka Rehman, Muhammad Usman, Dong-Kyu Chae, Junaid Qadir
📝 Abstract
Accurate thyroid nodule segmentation in ultrasound images is critical for diagnosis and treatment planning. However, ambiguous boundaries between nodules and surrounding tissues, size variations, and the scarcity of annotated ultrasound data pose significant challenges for automated segmentation. Existing deep learning models struggle to incorporate contextual information from the thyroid gland and generalize effectively across diverse cases. To address these challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-based Network that leverages unlabeled data to enhance Transformer-centric encoder feature extraction capability in an initial unsupervised phase. In the supervised phase, the model jointly optimizes nodule segmentation, gland segmentation, and nodule size estimation, integrating both local and global contextual features. Extensive evaluations on the TN3K and DDTI datasets demonstrate that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and robustness, indicating its potential for real-world clinical applications.
💡 Deep Analysis
📄 Full Content
Anatomy-Guided Representation Learning Using a
Transformer-Based Network for Thyroid Nodule
Segmentation in Ultrasound Images
Muhammad Umar Farooq1, Abd Ur Rehman2, Azka Rehman3, Muhammad
Usman4,Dong-Kyu Chae5, Junaid Qadir6
1
Department of Computer Science, Hanyang University, Seoul, 04762, South Korea
2
Department of Computer Science, The University of Alabama, Seoul, 04762, South
Korea
3
Department of Biomedical Sciences, Seoul National University, Seoul, 08826, South
Korea (azkarehman@snu.ac.kr)
4
Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University,
CA 94305, USA (usmanm@stanford.edu)
5
Department of Computer Science, Hanyang University, Seoul, 04762, South Korea
(dongkyu@hanyang.ac.kr)
6
Department of Computer Engineering, Qatar University, Doha, Qatar
(jqadir@qu.edu.qa)
Abstract
Accurate thyroid nodule segmentation in ultrasound images is critical for
diagnosis and treatment planning.
However, ambiguous boundaries between
nodules and surrounding tissues, size variations, and the scarcity of annotated
ultrasound data pose significant challenges for automated segmentation. Exist-
ing deep learning models struggle to incorporate contextual information from the
thyroid gland and generalize effectively across diverse cases. To address these
challenges, we propose SSMT-Net, a Semi-Supervised Multi-Task Transformer-
based Network that leverages unlabeled data to enhance Transformer-centric
encoder feature extraction capability in an initial unsupervised phase. In the su-
pervised phase, the model jointly optimizes nodule segmentation, gland segmen-
tation, and nodule size estimation, integrating both local and global contextual
features. Extensive evaluations on the TN3K and DDTI datasets demonstrate
that SSMT-Net outperforms state-of-the-art methods, with higher accuracy and
robustness, indicating its potential for real-world clinical applications.
Keywords:
Multi-task learning, semi-supervised learning, thyroid nodule
segmentation, transformer, ultrasound images.
arXiv:2512.12662v1 [cs.CV] 14 Dec 2025
1. Introduction
Automated thyroid nodule segmentation in ultrasound imaging plays a piv-
otal role in supporting radiologists by improving diagnostic accuracy and re-
ducing inter-observer variability [1]. Despite its importance, the task remains
challenging due to heterogeneous echogenic patterns, ambiguous boundaries,
and strong acoustic shadows commonly present in thyroid ultrasound scans [2].
While deep learning-based segmentation frameworks have demonstrated promis-
ing performance [3], their ability to capture long-range dependencies—critical
for accurate nodule delineation—remains limited [4, 5]. These limitations be-
come more pronounced in cases with substantial variations in nodule morphol-
ogy.
Existing approaches (e.g., [6, 7]) predominantly rely on single-task learn-
ing, often overlooking complementary contextual cues such as thyroid gland
structure, gland–nodule spatial interactions, and morphological priors. At the
same time, the field faces a persistent shortage of large-scale annotated datasets,
restricting model generalizability in real-world conditions. In contrast, recent
advances in medical imaging demonstrate the benefits of incorporating auxil-
iary tasks and attention-driven architectures across several domains, such as
mandibular canal delineation [8], lung nodule segmentation with adaptive ROI
selection [9, 10, 11], brain tumor segmentation [12, 13], diabetic retinopathy seg-
mentation [14], and broader biomedical detection challenges [15, 16, 17]. These
works consistently highlight how multi-scale attention, ROI adaptation, and
multi-encoder feature fusion contribute to improved lesion localization and ro-
bustness.
Transformer-based architectures have further accelerated progress by en-
abling superior long-range context modeling, benefiting not only medical image
reconstruction [18, 19, 20, 21, 22] but also segmentation and detection tasks
across MRI, CBCT, and CT modalities. Their ability to integrate global and
local representations has proven effective in diverse clinical workflows, such as
cardiomegaly assessment [23], multimodal neuroimaging fusion [24, 25], phono-
cardiographic signal analysis [26], and cross-lingual feature representation learn-
ing [27]. Collectively, these studies demonstrate that multi-task learning (MTL),
multi-encoder architectures, and attention-rich designs consistently outperform
strictly single-task CNN systems, especially in complex and low-contrast set-
tings.
MTL in particular has shown strong potential in enhancing primary task
performance by leveraging the inductive bias of complementary auxiliary tasks
[28, 29]. In medical imaging, MTL-driven designs have yielded notable improve-
ments in lung nodule detection [24], metaverse-based brain-age estimation [20],
COVID-19 classification [16, 17], and other clinical prediction tasks [30, 31].
Semi-supervised strategies also continue to play an important role in add