Text Embedded Swin-UMamba for DeepLesion Segmentation
Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow has the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, our method achieved a high Dice score of 82.64, and a low Hausdorff distance of 6.34 pixels was obtained for lesion segmentation. The proposed Text-Swin-U/Mamba model outperformed prior approaches: 37.79% improvement over the LLM-driven LanGuideMedSeg model (p < 0.001), and surpassed the purely image-based XLSTM-UNet and nnUNet models by 2.58% and 1.01%, respectively. The dataset and code can be accessed at https://github.com/ruida/LLM-Swin-UMamba
💡 Research Summary
This paper investigates the feasibility of incorporating short-form radiology report text into a state‑of‑the‑art image segmentation backbone for CT lesion segmentation. The authors extend the Swin‑UMamba architecture—a hybrid of Swin‑Transformer windowed self‑attention and the Mamba linear‑complexity memory module—by adding a “Text Tower” that encodes clinical sentences using a pre‑trained BioLord model. Token embeddings are aggregated with max‑pooling to produce a fixed‑size 768‑dimensional vector, which is linearly projected and fused with image features at five decoder stages via a dedicated Lang Fusion layer. The resulting Text‑Swin‑UMamba model thus receives multimodal cues at multiple spatial scales.
The study uses the ULS23 DeepLesion dataset, which contains 32,120 2‑D CT slices from 4,427 patients, each paired with a one‑sentence description of the finding. The data are split at the patient level into training (15,040 slices), validation (3,760), and test (1,807) sets, and a 5‑fold cross‑validation scheme is employed. Training uses a combined Dice‑plus‑cross‑entropy loss, AdamW optimizer (initial LR = 5e‑3) with cosine annealing, and runs for 1500 epochs on a single NVIDIA V100 GPU.
Performance is compared against three baselines: LanGuideMedSeg (an LLM‑driven approach), xLSTM‑UNet, and a 2‑D nnUNet. Text‑Swin‑UMamba achieves a mean Dice of 82.64 % ± 17.36, Jaccard of 73.49 % ± 20.64, Hausdorff distance of 6.34 ± 10.48 px, sensitivity of 84.60 % ± 15.40, and specificity of 99.82 % ± 0.18. This represents a 37.79 % absolute improvement over LanGuideMedSeg, 2.58 % over xLSTM‑UNet, and 1.01 % over nnUNet (all p < 0.001). Hausdorff distance is also reduced by 0.52 px compared with nnUNet.
Ablation studies explore different pooling strategies for the Text Tower (mean, weighted‑average, max) and the injection depth of text features. Max‑pooling yields the highest Dice (82.64 %). Injecting text at all decoder stages (full injection) improves Dice by only 0.11 % relative to injecting at the decoder only, indicating that the short sentences provide limited token diversity. Removing the text module altogether drops Dice to 81.53 %, confirming that even modest textual cues can enhance segmentation.
The discussion highlights that short radiology sentences convey lesion type, anatomical location, and qualitative attributes (e.g., hypoattenuation, enhancement), which can disambiguate visually challenging cases, especially for small lesions or poorly defined boundaries. However, the authors acknowledge several limitations: the text is extremely concise (single sentence), the dataset uses cropped 2‑D slices (limiting 3‑D contextual information), and the current system lacks a text decoder for report generation or grounding of anatomical references. Future work aims to incorporate long‑form report embeddings, develop a bidirectional text‑image attention mechanism, and integrate lesion detection and automatic report generation into a unified framework.
In summary, the study demonstrates that integrating LLM‑derived textual embeddings into a modern vision transformer backbone yields statistically significant improvements in CT lesion segmentation, even when the textual input is brief. The modest gains suggest that text provides complementary global context that image‑only models cannot capture, and the proposed multimodal fusion strategy offers a promising direction for building more clinically aware AI systems that combine imaging and narrative data.
Comments & Academic Discussion
Loading comments...
Leave a Comment