Electrical Engineering and Systems Science / Image Processing

RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion

February 20, 2026

Reading time: 5 minute

...

#Image Processing #Electrical Engineering and Systems Science

📝 Original Info

Title: RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion
ArXiv ID: 2602.16320
Date: 2026-02-18
Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. 저자 명단이 필요할 경우 원문을 확인해 주세요. **

📝 Abstract

Accurate and computationally efficient 3D medical image segmentation remains a critical challenge in clinical workflows. Transformer-based architectures often demonstrate superior global contextual modeling but at the expense of excessive parameter counts and memory demands, restricting their clinical deployment. We propose RefineFormer3D, a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for volumetric medical imaging. The architecture integrates three key components: (i) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, (ii) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and (iii) a cross-attention fusion decoder enabling adaptive multi-scale skip connection integration. RefineFormer3D contains only 2.94M parameters, substantially fewer than contemporary transformer-based methods. Extensive experiments on ACDC and BraTS benchmarks demonstrate that RefineFormer3D achieves 93.44\% and 85.9\% average Dice scores respectively, outperforming or matching state-of-the-art methods while requiring significantly fewer parameters. Furthermore, the model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments. These results establish RefineFormer3D as an effective and scalable solution for practical 3D medical image segmentation.

💡 Deep Analysis

📄 Full Content

Deep learning has fundamentally reshaped medical image analysis as extraction of complex patterns from clinical data can be done automatically at unprecedented scales. Among these advances, 3D medical image segmentation stands as a foundational task which is crucial for applications ranging from organ localization and tumor delineation to treatment planning. Traditional encoder-decoder networks such as U-Net [32] and its derivatives [16,44] have long been the backbone of volumetric segmentation, and they effectively compress input data into latent representations and reconstruct dense voxel-wise predictions. However, the limited receptive field of convolutional operators and their inherent locality bias restrict their capacity to model global anatomical context, particularly in cases involving large inter patient variation in scale, texture, and shape.

To address these limitations, transformer based architectures have emerged as a powerful alternative as they leverage global self-attention mechanisms [15] to capture long range dependencies and semantic coherence across medical volumes. Pioneering contributions such as TransUNet [5] and UNETR [12] have demonstrated the effectiveness of integrating Vision Transformers with convolutional decoders, leading to substantial improvements over earlier purely convolutional approaches. SWIN-Unet [4] have further explored hierarchical and pure transformer U-shaped architectures, affirming the value of transformer backbones for contextual representation learning in segmentation tasks.

However, these gains come at a cost. As full self-attention incurs heavy memory overhead and computational burden, it raises concerns about clinical feasibility where resource constraints matter. This limits their applicability in real world clinical scenarios where efficiency and reliability are paramount. Moreover, the current prevailing skip fusion strategies are typically based on static concatenation or convolutional operations, and they may inadequately integrate multi-scale features, which undermines segmentation performance in anatomically complex or ambiguous regions. The conventional concatenation-based skip connections treat all encoder features uniformly, failing to selectively aggregate semantically relevant information for the decoder’s current reconstruction state. This naive fusion not only introduces redundant features but also foregoes adaptive, query driven selection mechanisms that could align encoder context with decoder specific requirements. This limitation becomes acute when segmenting the heterogeneous anatomical structures with variable appearances.

Recent state-of-the-art research, including nnFormer [43] and SegFormer3D [30], tried to reconcile this trade-off between performance and efficiency by introducing hybrid or windowed attention schemes and lighter multilayer perceptron (MLP) based decoders. These architectures have made strides in reducing computational costs. Models such as LeViT-UNet [39] have specifically targeted inference efficiency by employing fast transformer encoders. Despite these advances, many contemporary transformer models still retain excessive parameter counts, especially within skip fusion and decoding modules, or they tend to sacrifice global feature integration to achieve efficiency. Furthermore, the repetitive and compressible nature of volumetric medical data a characteristic ideally suited for lightweight, context aware modeling remains underexploited in existing segmentation frameworks.

In response to these persistent challenges, we propose RefineFormer3D, a hierarchical multiscale transformer architecture engineered for both parametric efficiency and robust contextual reasoning in 3D medical image segmentation. The core of our method is the cross attention fusion decoder block, which uses GhostConv3D for efficient 3D feature processing and enhances it with channel wise attention using Squeeze and Excitation mechanisms. This adaptively and dynamically aggregates the multi-scale features across the decoder. The attention aware fusion strategy refines semantic integration throughout the network while minimizing computational overhead. The encoder employs hierarchical windowed self-attention and MixFFN3D modules with low-rank projections and depthwise 3D convolutions. This design captures both global dependencies and fine anatomical details, effectively balancing accuracy and computational efficiency.

Figure 1 illustrates the performance efficiency trade-off on the ACDC dataset [2]. Refine-Former3D achieves superior performance with only 2.94 million parameters, outperforming state- Extensive experiments on widely used medical segmentation benchmarks, including ACDC [2], and BraTS [27], demonstrate that RefineFormer3D not only outperforms state-of-the-art models such as nnFormer, SegFormer3D, and UNETR in segmentation accuracy but also achieves notable reductions in parameter count and inference time. These results project Refin

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

Automated Assessment of Kidney Ureteroscopy Exploration for Training

Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model

Benchmarking Self-Supervised Models for Cardiac Ultrasound View Classification

Start searching

No results found