NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts

February 23, 2026

Reading time: 5 minute

...

📝 Original Info

Title: NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts
ArXiv ID: 2512.20783
Date: 2025-12-23
Authors: Raja Mallina, Bryar Shareef

📝 Abstract

Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.

💡 Deep Analysis

Deep Dive into NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts.

📄 Full Content

Breast cancer is the most commonly diagnosed cancer in U.S. women and a major cause of death, with an estimated 316,950 new invasive cases and 42,680 deaths in 2025 [1]. Early detection and accurate lesion localization are essential for improving diagnostic outcomes and guiding treatment. Mammography remains the main screening test [2], while breast ultrasound (BUS) is widely used as a complementary exam because it is safe, real-time, portable, and relatively low-cost. Unlike mammography, BUS uses no ionizing radiation and is particularly helpful in dense breasts [3]. Despite these advantages, BUS interpretation is challenging: speckle noise, acoustic shadowing, operator dependence, and low contrast often obscure lesion boundaries, making manual assessment variable and time-intensive. In clinical workflows, segmentation delineates lesion extent so that size, shape, and margins can be measured consistently and passed to downstream computer-aided diagnosis (CAD) systems. Automatic BUS segmentation has become a central task in computer-aided analysis, and several public benchmarks now support systematic evaluation on challenging lesions [4].

Over the past decade, encoder-decoder architectures (e.g., U-Net variants) have become the standard backbone for BUS segmentation [5,6]. Refinements such as attention, multi-scale aggregation and feature pyramids, dilated convolutions and larger receptive fields, residual and dense connections, and stronger decoders further improve the base design [7,8,9]. Beyond such generic upgrades, task-specific BUS methods target small or low-contrast lesions, speckle noise, and spiculated margins, showing gains on difficult cases [10]. Nevertheless, generalization across scanners, institutions, and patient demographics remains challenging when training data are narrow in scope. In parallel, Transformer-based and hybrid CNN-Transformer segmenters have matured, yet they still degrade under distribution shift without broader, multi-source supervision [11].

Prompt-based segmentation offers a complementary direction. Point or box prompts can steer masks with minimal interaction, and text-conditioned variants enable semantic guidance [12,13]. In BUS, text-guided approaches have begun to show promise by conditioning on short descriptors or standardized terms, improving delineation when such information is available [14]. A practical limitation, however, is that many BUS datasets lack reliable prompts or metadata (e.g., BI-RADS descriptors) or provide them inconsistently. As a result, training is often restricted to the few multimodal datasets [14,4] or to small single-center cohorts, leaving larger image-only resources underutilized. This motivates methods that exploit prompts when present and remain robust when absent, so that mixed sources can be used without discarding data.

We address this constraint with a mixed-supervision framework that treats “no text” as a first-class state and integrates prompts at two complementary levels. Concretely, our model couples (i) a global pathway that leverages imagelevel context and (ii) a local pathway that conditions midand low-level features. When text is present, short descriptors guide both pathways; when text is missing, nullable prompts-learnable null embeddings with presence masks-let the network fall back to image-only evidence without unstable imputations. This design enables training across heterogeneous datasets that differ in prompt availabil-ity while preserving strong image-only performance. The most related work is ARSeg [15], which tackles incomplete textual prompts in medical referring segmentation via attribute-specific cross-modal interactions and dual attribute losses. In contrast, we address absent prompts at the dataset level and train across mixed sources using learnable nullable embeddings, allowing effective use of large image-only cohorts.

Our contributions are as follows. (i) We propose NullBUS, a dual-path prompt-aware BUS segmentation network that combines global image-level and local feature-level conditioning. (ii) We introduce nullable prompts with learnable null embeddings and presence masks, enabling joint training on images with and without text. (iii) We establish a mixeddataset evaluation on three public BUS datasets and show that NullBUS improves IoU/Dice and reduces FNR compared with strong image-only and prompt-based baselines.

We propose NullBUS, a prompt-optional segmentation framework for breast ultrasound. The model comprises two visual paths and a U-shaped decoder (Fig. 1). The global path provides image-level, text-conditioned context using a frozen CLIP ViT, implemented as a Global Prompt Encoder (GPE), followed by a Global Feature Projector (GFP) that produces a 2048-channel feature map. The local path extracts spatial detail with a ResNet-50 encoder; at its deepest stages, Text-Conditioned Modulation (TCM) injects prompt information and an ASPP bottleneck enlarges the receptive field. When text is

…(Full text truncated)…

📄 Read Full PDF on ArXiv