When Mamba Meets xLSTM: An Efficient and Precise Method with the xLSTM-VMUNet Model for Skin lesion Segmentation
Automatic melanoma segmentation is essential for early skin cancer detection, yet challenges arise from the heterogeneity of melanoma, as well as interfering factors like blurred boundaries, low contrast, and imaging artifacts. While numerous algorithms have been developed to address these issues, previous approaches have often overlooked the need to jointly capture spatial and sequential features within dermatological images. This limitation hampers segmentation accuracy, especially in cases with indistinct borders or structurally similar lesions. Additionally, previous models lacked both a global receptive field and high computational efficiency. In this work, we present the xLSTM-VMUNet Model, which jointly capture spatial and sequential features within dermatological images successfully. xLSTM-VMUNet can not only specialize in extracting spatial features from images, focusing on the structural characteristics of skin lesions, but also enhance contextual understanding, allowing more effective handling of complex medical image structures. Experiment results on the ISIC2018 dataset demonstrate that xLSTM-VMUNet outperforms VMUNet by 4.85% on DSC and 6.41% on IoU on the ISIC2017 dataset, by 1.25% on DSC and 2.07% on IoU on the ISIC2018 dataset, with faster convergence and consistently high segmentation performance. Our code is available at https://github.com/FangZhuoyi/XLSTM-VMUNet.
💡 Research Summary
This paper introduces xLSTM-VMUNet, a novel deep learning model designed to address the challenges of automatic skin lesion segmentation, which is crucial for early melanoma detection. The core innovation lies in the synergistic integration of two advanced architectures: a Visual State Space Model (VSSM, based on Mamba) for spatial feature extraction and an extended Long Short-Term Memory (xLSTM) network for enhanced sequential and contextual modeling.
The authors begin by outlining the difficulties in skin lesion segmentation, such as blurred boundaries, low contrast, and heterogeneous lesion appearances. They critique existing methods, including U-Net variants and Vision Transformers, for often failing to jointly capture both spatial details and sequential/structural relationships within the image, while also struggling to balance a global receptive field with computational efficiency.
The proposed xLSTM-VMUNet model is structured as a dual-path architecture. First, the VSSM module acts as the primary encoder-decoder. It hierarchically extracts multi-scale spatial features from the input image. The VSSM leverages the selective scanning mechanism of Mamba, which provides linear computational complexity and effective long-range dependency modeling, making it efficient and powerful for capturing visual patterns. The decoder then upsamples these features, using skip connections to recover spatial details for precise boundary reconstruction.
Second, the high-dimensional feature maps produced by the VSSM are reshaped into sequential data and fed into the xLSTM module. The xLSTM component incorporates two specialized variants: the structured LSTM (sLSTM) and the multi-scale LSTM (mLSTM). The sLSTM uses a block-diagonal weight matrix for efficient gating and local pattern emphasis, while the mLSTM employs a multi-scale attention mechanism to establish global dependencies across the feature sequence. This allows the model to refine the initially extracted spatial features by understanding their complex interrelationships and temporal context, which is particularly valuable for analyzing intricate medical image structures.
Finally, a multi-level feature integration mechanism combines the outputs from both the VSSM and xLSTM pathways through concatenation and weighted fusion. This ensures that detailed spatial information and enriched contextual understanding are jointly utilized to generate the final, accurate segmentation mask.
The model is rigorously evaluated on the public ISIC2017 and ISIC2018 skin lesion segmentation datasets. Experimental results demonstrate that xLSTM-VMUNet outperforms the strong baseline VMUNet, achieving improvements of 4.00% in Dice Similarity Coefficient (DSC) and 6.93% in Intersection over Union (IoU) on ISIC2017, and 1.25% in DSC and 2.07% in IoU on ISIC2018. Furthermore, the model exhibits faster convergence during training and maintains consistently high performance across various lesion types. The paper concludes that the effective fusion of Mamba’s efficient spatial modeling with xLSTM’s advanced sequential processing capability offers a promising direction for accurate and efficient medical image segmentation. The source code is publicly released to facilitate further research and reproducibility.
Comments & Academic Discussion
Loading comments...
Leave a Comment