Accurate and computationally efficient 3D medical image segmentation remains a critical challenge in clinical workflows. Transformer-based architectures often demonstrate superior global contextual modeling but at the expense of excessive parameter counts and memory demands, restricting their clinical deployment. We propose RefineFormer3D, a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for volumetric medical imaging. The architecture integrates three key components: (i) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, (ii) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and (iii) a cross-attention fusion decoder enabling adaptive multi-scale skip connection integration. RefineFormer3D contains only 2.94M parameters, substantially fewer than contemporary transformer-based methods. Extensive experiments on ACDC and BraTS benchmarks demonstrate that RefineFormer3D achieves 93.44\% and 85.9\% average Dice scores respectively, outperforming or matching state-of-the-art methods while requiring significantly fewer parameters. Furthermore, the model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments. These results establish RefineFormer3D as an effective and scalable solution for practical 3D medical image segmentation.
Deep learning has fundamentally reshaped medical image analysis as extraction of complex patterns from clinical data can be done automatically at unprecedented scales. Among these advances, 3D medical image segmentation stands as a foundational task which is crucial for applications ranging from organ localization and tumor delineation to treatment planning. Traditional encoder-decoder networks such as U-Net [32] and its derivatives [16,44] have long been the backbone of volumetric segmentation, and they effectively compress input data into latent representations and reconstruct dense voxel-wise predictions. However, the limited receptive field of convolutional operators and their inherent locality bias restrict their capacity to model global anatomical context, particularly in cases involving large inter patient variation in scale, texture, and shape.
To address these limitations, transformer based architectures have emerged as a powerful alternative as they leverage global self-attention mechanisms [15] to capture long range dependencies and semantic coherence across medical volumes. Pioneering contributions such as TransUNet [5] and UNETR [12] have demonstrated the effectiveness of integrating Vision Transformers with convolutional decoders, leading to substantial improvements over earlier purely convolutional approaches. SWIN-Unet [4] have further explored hierarchical and pure transformer U-shaped architectures, affirming the value of transformer backbones for contextual representation learning in segmentation tasks.
However, these gains come at a cost. As full self-attention incurs heavy memory overhead and computational burden, it raises concerns about clinical feasibility where resource constraints matter. This limits their applicability in real world clinical scenarios where efficiency and reliability are paramount. Moreover, the current prevailing skip fusion strategies are typically based on static concatenation or convolutional operations, and they may inadequately integrate multi-scale features, which undermines segmentation performance in anatomically complex or ambiguous regions. The conventional concatenation-based skip connections treat all encoder features uniformly, failing to selectively aggregate semantically relevant information for the decoder’s current reconstruction state. This naive fusion not only introduces redundant features but also foregoes adaptive, query driven selection mechanisms that could align encoder context with decoder specific requirements. This limitation becomes acute when segmenting the heterogeneous anatomical structures with variable appearances.
Recent state-of-the-art research, including nnFormer [43] and SegFormer3D [30], tried to reconcile this trade-off between performance and efficiency by introducing hybrid or windowed attention schemes and lighter multilayer perceptron (MLP) based decoders. These architectures have made strides in reducing computational costs. Models such as LeViT-UNet [39] have specifically targeted inference efficiency by employing fast transformer encoders. Despite these advances, many contemporary transformer models still retain excessive parameter counts, especially within skip fusion and decoding modules, or they tend to sacrifice global feature integration to achieve efficiency. Furthermore, the repetitive and compressible nature of volumetric medical data a characteristic ideally suited for lightweight, context aware modeling remains underexploited in existing segmentation frameworks.
In response to these persistent challenges, we propose RefineFormer3D, a hierarchical multiscale transformer architecture engineered for both parametric efficiency and robust contextual reasoning in 3D medical image segmentation. The core of our method is the cross attention fusion decoder block, which uses GhostConv3D for efficient 3D feature processing and enhances it with channel wise attention using Squeeze and Excitation mechanisms. This adaptively and dynamically aggregates the multi-scale features across the decoder. The attention aware fusion strategy refines semantic integration throughout the network while minimizing computational overhead. The encoder employs hierarchical windowed self-attention and MixFFN3D modules with low-rank projections and depthwise 3D convolutions. This design captures both global dependencies and fine anatomical details, effectively balancing accuracy and computational efficiency.
Figure 1 illustrates the performance efficiency trade-off on the ACDC dataset [2]. Refine-Former3D achieves superior performance with only 2.94 million parameters, outperforming state- Extensive experiments on widely used medical segmentation benchmarks, including ACDC [2], and BraTS [27], demonstrate that RefineFormer3D not only outperforms state-of-the-art models such as nnFormer, SegFormer3D, and UNETR in segmentation accuracy but also achieves notable reductions in parameter count and inference time. These results project Refin
This content is AI-processed based on open access ArXiv data.