Fast and Accurate 3D Medical Image Segmentation with Data-swapping Method

Fast and Accurate 3D Medical Image Segmentation with Data-swapping   Method
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep neural network models used for medical image segmentation are large because they are trained with high-resolution three-dimensional (3D) images. Graphics processing units (GPUs) are widely used to accelerate the trainings. However, the memory on a GPU is not large enough to train the models. A popular approach to tackling this problem is patch-based method, which divides a large image into small patches and trains the models with these small patches. However, this method would degrade the segmentation quality if a target object spans multiple patches. In this paper, we propose a novel approach for 3D medical image segmentation that utilizes the data-swapping, which swaps out intermediate data from GPU memory to CPU memory to enlarge the effective GPU memory size, for training high-resolution 3D medical images without patching. We carefully tuned parameters in the data-swapping method to obtain the best training performance for 3D U-Net, a widely used deep neural network model for medical image segmentation. We applied our tuning to train 3D U-Net with full-size images of 192 x 192 x 192 voxels in brain tumor dataset. As a result, communication overhead, which is the most important issue, was reduced by 17.1%. Compared with the patch-based method for patches of 128 x 128 x 128 voxels, our training for full-size images achieved improvement on the mean Dice score by 4.48% and 5.32 % for detecting whole tumor sub-region and tumor core sub-region, respectively. The total training time was reduced from 164 hours to 47 hours, resulting in 3.53 times of acceleration.


💡 Research Summary

This paper addresses the critical memory limitation that hampers training of high‑resolution 3‑dimensional (3D) medical image segmentation networks on modern GPUs. Conventional approaches mitigate the problem by dividing large volumes into small patches (e.g., 128 × 128 × 128 voxels) before feeding them to a 3D U‑Net. While effective for fitting into GPU memory, patch‑based training degrades segmentation quality when lesions span multiple patches, and it also increases overall training time due to redundant processing of overlapping regions.

The authors propose to eliminate patching altogether by employing a data‑swapping technique, which moves intermediate feature maps from GPU memory to CPU memory after the forward pass and brings them back just before they are needed in the backward pass. This method dramatically expands the effective GPU memory capacity, allowing the full‑size brain MRI volumes (192 × 192 × 192 voxels) from the BraTS‑2017 dataset to be processed in a single forward‑backward iteration.

Implementation relies on TensorFlow Large Model Support (TFLMS), an open‑source extension that automatically inserts swap‑out and swap‑in nodes into the computational graph. The key to making data‑swapping practical is minimizing the communication overhead between CPU and GPU. The authors systematically explore four configurations of TFLMS parameters:

  1. Config 1 – swap all 843 feature maps (n_tensors = ‑1). Memory usage is minimal, but communication stalls dominate, causing long idle periods between forward and backward passes.
  2. Config 2 – swap only the first N maps (n_tensors = N). Although fewer swaps reduce overhead, the breadth‑first search of TFLMS still selects large synthesis‑path maps, leaving significant latency.
  3. Config 3 – exclude the synthesis (decoder) path from swapping (excl_scopes). This removes the most time‑consuming swaps during the backward pass, cutting the idle gap.
  4. Config 4 – combine exclusion of the synthesis path with a large “lb” value, which pre‑fetches swapped‑in data earlier, allowing overlap of communication with computation.

Empirical results show that Config 4 reduces total communication overhead by 17.1 % compared with Config 1, while keeping peak GPU memory within the 16 GB limit of a Tesla P100.

The experimental setup uses an IBM Power S822LC server (2 × POWER8 CPUs, 512 GB RAM) and a single NVIDIA Tesla P100 GPU connected via NVLink 1.0 (80 GB/s bidirectional bandwidth). Training employs the Adam optimizer (initial LR = 5e‑4, decay factor 0.5, patience 10) and Dice loss, with 5‑fold cross‑validation and on‑the‑fly data augmentation (random flips and axis permutations).

Performance comparison:

  • Segmentation quality: Full‑volume training achieved mean Dice scores of +4.48 % for the whole‑tumor region and +5.32 % for the tumor‑core region relative to the patch‑based baseline (128³ patches). This confirms that preserving global context improves lesion delineation.
  • Training time: One epoch with Config 4 required 17.1 % less time than Config 1. Overall, the full‑volume approach reduced total training time from 164 hours (patch‑based) to 47 hours, a 3.53× speed‑up.
  • Comparison with recomputation: A checkpoint‑based recomputation strategy (discarding intermediate activations and recomputing them during back‑propagation) was also evaluated. Data‑swapping outperformed recomputation by 14.4 % per epoch, because recomputation incurs extra FLOPs while data‑swapping only adds communication latency, which can be overlapped with computation.

Significance: The study demonstrates that data‑swapping, when carefully tuned, enables training of large 3D segmentation networks on commodity GPUs without sacrificing accuracy or incurring prohibitive training times. It provides a concrete tuning guide (n_tensors, lb, excl_scopes) that can be adapted to other volumetric models such as V‑Net or nnU‑Net.

Limitations and future work: Experiments were limited to a single‑GPU environment; scaling to multi‑GPU or distributed settings will require additional coordination of swap operations and may be constrained by inter‑node bandwidth. The approach also depends on sufficient CPU memory and fast CPU‑GPU interconnects; systems with slower PCIe links may see reduced benefits. Future research will explore hybrid strategies that combine selective recomputation with swapping, automated hyper‑parameter search for optimal TFLMS settings, and application to other imaging modalities (e.g., lung CT, cardiac MRI).

In summary, the paper provides a practical, well‑validated solution to the GPU memory bottleneck in high‑resolution 3D medical image segmentation, achieving both higher segmentation accuracy and substantially faster training compared with traditional patch‑based pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment