FreqDINO: Frequency-Guided Adaptation for Generalized Boundary-Aware Ultrasound Image Segmentation
Ultrasound image segmentation is pivotal for clinical diagnosis, yet challenged by speckle noise and imaging artifacts. Recently, DINOv3 has shown remarkable promise in medical image segmentation with its powerful representation capabilities. However, DINOv3, pre-trained on natural images, lacks sensitivity to ultrasound-specific boundary degradation. To address this limitation, we propose FreqDINO, a frequency-guided segmentation framework that enhances boundary perception and structural consistency. Specifically, we devise a Multi-scale Frequency Extraction and Alignment (MFEA) strategy to separate low-frequency structures and multi-scale high-frequency boundary details, and align them via learnable attention. We also introduce a Frequency-Guided Boundary Refinement (FGBR) module that extracts boundary prototypes from high-frequency components and refines spatial features. Furthermore, we design a Multi-task Boundary-Guided Decoder (MBGD) to ensure spatial coherence between boundary and semantic predictions. Extensive experiments demonstrate that FreqDINO surpasses state-of-the-art methods with superior achieves remarkable generalization capability. The code is at https://github.com/MingLang-FD/FreqDINO.
💡 Research Summary
The paper introduces FreqDINO, a novel framework that adapts the self‑supervised vision transformer DINOv3 for the challenging task of ultrasound image segmentation. Ultrasound images suffer from speckle noise, low contrast, and blurred boundaries, which are poorly captured by models pre‑trained on natural images. The authors argue that ultrasound boundaries are fundamentally encoded in the high‑frequency components of the image, while low‑frequency components represent smooth anatomical structures. To exploit this property, they design three synergistic modules:
-
Multi‑scale Frequency Extraction and Alignment (MFEA) – The frozen DINOv3‑Large encoder produces spatial features that are decomposed by a Haar wavelet transform at two scales (original resolution and a down‑sampled version). This yields a low‑frequency structure map and three high‑frequency sub‑bands (horizontal, vertical, diagonal). After 1×1 convolutions for channel reduction, learnable boundary‑attention and structure‑attention maps are generated by lightweight networks. These attention maps are combined with learnable scalar weights (α, β, λ) and added back to the original features via a residual modulation, effectively aligning high‑frequency boundary cues with low‑frequency structural context.
-
Frequency‑Guided Boundary Refinement (FGBR) – The high‑frequency outputs from MFEA are concatenated and further compressed into a 64‑dimensional “boundary prototype”. This prototype serves as the key‑value pair in a cross‑modal multi‑head attention (8 heads, 128‑dim per head) where the query comes from the attention‑enhanced spatial features. The attention output is fused back with a learnable weight ω, producing refined features that carry both global semantic information from DINOv3 and precise boundary details extracted from the frequency domain.
-
Multi‑task Boundary‑Guided Decoder (MBGD) – Refined features are up‑sampled through four transposed‑convolution blocks to obtain a shared feature map. A 1×1 convolution first predicts a boundary map, which is passed through a sigmoid and a 3×3 convolution to generate boundary features. These are concatenated with the shared map and fed to a final 1×1 convolution to produce the segmentation mask. The decoder is trained with a combined loss: binary cross‑entropy for the mask plus a weighted binary cross‑entropy for the boundary (λ_b). This boundary‑first strategy enforces spatial coherence between the two predictions.
Experimental Setup – The authors evaluate on two public ultrasound datasets: BUSI (breast) and TN3K (thyroid). Images are resized to 512×512, and training uses Adam (lr = 1e‑4, decay = 0.98) for 300 epochs with batch size 16 on an NVIDIA A5000 GPU. The DINOv3 encoder remains frozen; only lightweight adapters, the frequency modules, and the decoder are trained.
Results – On BUSI, FreqDINO achieves Dice = 86.52 %, mIoU = 78.49 %, and Hausdorff Distance = 39.63 mm, surpassing the strongest baseline nnU‑Net (Dice = 84.80 %). On the unseen TN3K dataset, the model is evaluated zero‑shot (no fine‑tuning) and still outperforms other foundation‑model‑based methods, attaining Dice = 62.09 % and HD = 108.01 mm. An ablation study shows that MFEA alone yields a 2.21 % Dice gain and 3 mm HD reduction; adding FGBR brings further improvements, and the full MBGD completes the performance boost.
Analysis – The work demonstrates that explicit frequency decomposition can bridge the domain gap between natural‑image pre‑training and ultrasound imaging. By separating and then re‑integrating high‑frequency boundary cues, the model becomes robust to speckle noise while preserving fine edge details. The cross‑modal attention in FGBR effectively propagates boundary information across the whole feature map, and the boundary‑first decoder ensures that the final mask respects the refined edges. The use of a frozen backbone with adapters keeps the parameter count low, making the approach computationally efficient.
Limitations & Future Directions – The study relies on Haar wavelets; exploring alternative transforms (e.g., DCT, Laplacian pyramids) could reveal more effective frequency representations. Boundary ground‑truth is generated automatically from masks, which may introduce noise; acquiring manually annotated boundaries would provide a cleaner supervision signal. The current pipeline handles only 2D static images; extending to 3D volumes or real‑time video streams would broaden clinical applicability. Finally, hyper‑parameters such as λ_b, α, β, and λ are fixed to zero in the paper; systematic tuning could further improve results.
Conclusion – FreqDINO offers a compelling solution for ultrasound segmentation by marrying self‑supervised transformer representations with frequency‑guided boundary enhancement. Its three‑module architecture—MFEA, FGBR, and MBGD—delivers state‑of‑the‑art accuracy and superior generalization, especially in zero‑shot scenarios. The methodology paves the way for frequency‑aware adaptation of foundation models across diverse medical imaging modalities.
Comments & Academic Discussion
Loading comments...
Leave a Comment