Uncertainty-Aware Vision-Language Segmentation for Medical Imaging
We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: https://github.com/arya-domain/UA-VLS
💡 Research Summary
The paper presents a novel uncertainty‑aware multimodal segmentation framework that jointly processes radiological images and associated clinical text. The core of the architecture is the Modality Decoding Attention Block (MoDAB), which first applies multi‑head self‑attention to visual feature maps extracted by a frozen ConvNeXt‑Tiny encoder, capturing intra‑modal dependencies. It then performs multi‑head cross‑attention where the visual queries attend to text keys and values. Textual embeddings, obtained from a frozen BioViL‑CXR‑BERT, are first projected and passed through a lightweight State Space Mixer (SSMix). SSMix combines depthwise 1‑D convolutions with a selective state‑space model (SSM) to capture both local and long‑range temporal dynamics in the token sequence, while keeping computational complexity linear in sequence length. The cross‑attention output is scaled by a learnable factor and added residually to the visual features, ensuring balanced fusion.
The decoder follows a U‑Net‑style upsampling pipeline: transposed convolutions double spatial resolution at each stage, concatenated with corresponding encoder features, refined by convolutional blocks, and finally reconstructed using a sub‑pixel upsampling network with pixel‑shuffle. An average‑pooling layer and a 1×1 convolution produce the final segmentation mask.
Training is guided by the Spectral‑Entropic Uncertainty (SEU) loss, a unified objective that blends three terms: (1) a Dice‑like overlap loss for spatial accuracy, (2) a spectral consistency term that regularizes the Fourier spectrum of the predicted mask, and (3) an entropy‑based uncertainty term that penalizes over‑confident predictions on ambiguous pixels. The three components are weighted by hyper‑parameters (λ₁, λ₂, λ₃), empirically set to 1.0, 0.1, and 0.5 respectively.
Experiments on three public benchmarks—QATA‑COVID19 (chest X‑ray), MosMed++ (CT), and Kvasir‑SEG (endoscopy)—show that the proposed method outperforms state‑of‑the‑art models such as nnU‑Net, TransUNet, and CMIRNet. It achieves higher Dice scores (average improvement of ~2.3 percentage points), better IoU, reduced Hausdorff distance, and lower Expected Calibration Error, indicating more reliable uncertainty estimates. Moreover, the model requires only about 1.2 GFLOPs and ~12 M parameters, roughly 30‑45 % less computation than comparable transformer‑based approaches, thanks to the efficient SSMix and the use of frozen encoders.
The authors acknowledge limitations: the frozen text encoder may not capture domain‑specific terminology fully; SSMix hyper‑parameters are dataset‑sensitive; and the weighting of the uncertainty term requires careful tuning. Future work is suggested on fine‑tuning the language model, automated hyper‑parameter search, and integrating active learning based on uncertainty to further reduce annotation costs.
Overall, the paper introduces a compelling combination of efficient cross‑modal attention, state‑space sequence modeling, and uncertainty‑aware loss design, advancing the reliability and practicality of vision‑language segmentation in medical imaging.
Comments & Academic Discussion
Loading comments...
Leave a Comment