PyRadiomics-cuda: 3D features extraction from medical images for HPC using GPU acceleration

PyRadiomics-cuda: 3D features extraction from medical images for HPC using GPU acceleration
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

PyRadiomics-cuda is a GPU-accelerated extension of the PyRadiomics library, designed to address the computational challenges of extracting three-dimensional shape features from medical images. By offloading key geometric computations to GPU hardware it dramatically reduces processing times for large volumetric datasets. The system maintains full compatibility with the original PyRadiomics API, enabling seamless integration into existing AI workflows without code modifications. This transparent acceleration facilitates efficient, scalable radiomics analysis, supporting rapid feature extraction essential for high-throughput AI pipeline. Tests performed on a typical computational cluster, budget and home devices prove usefulness in all scenarios.


💡 Research Summary

PyRadiomics‑cuda is a GPU‑accelerated extension of the widely used PyRadiomics library, targeting the most time‑consuming part of radiomics pipelines: three‑dimensional shape feature extraction. The authors designed the extension to be completely transparent to existing code, preserving the original PyRadiomics API while automatically detecting CUDA‑capable hardware at runtime and falling back to the CPU implementation when no GPU is available.

The core of the acceleration focuses on two computationally intensive steps. First, the Marching Cubes algorithm, which builds a triangular mesh of the region‑of‑interest (ROI) and simultaneously accumulates volume and surface area, is parallelized by assigning each voxel to an independent CUDA thread. When a thread discovers that the isosurface passes through its voxel, it generates the corresponding triangle and updates global data structures using atomic operations and block‑level reductions. Second, the calculation of the maximum 3‑D diameter (and planar diameters) requires evaluating pairwise distances between all mesh vertices, an O(m²) operation. The authors implemented a parallel reduction where each thread processes a subset of vertex pairs, stores a local maximum, and then participates in a hierarchical reduction to obtain the global maximum distance.

Five optimization strategies were explored: (1) load‑balancing across threads, (2) block‑based atomic reductions, (3) shared‑memory tiling to reduce global memory traffic, (4) per‑thread local accumulators to minimize atomic contention, and (5) simplifying data structures from 2‑D to 1‑D arrays for faster indexing. Performance varied with GPU architecture: modern NVIDIA H100 GPUs benefited from fast atomic operations and thus favored the local‑accumulator approach, while older T4 GPUs performed best with block‑based reductions. The RTX 4070 showed the best overall speedup using a combination of shared memory and local accumulators.

Benchmarking was performed on 20 cases from the public KITS19 kidney‑tumor segmentation challenge, covering a wide range of image sizes (50 KB–9 MB) and mesh vertex counts (≈2.7 k–236 k). Three hardware configurations were tested: (1) a high‑end AI cluster equipped with an NVIDIA H100, (2) a consumer‑grade desktop with an RTX 4070, and (3) a budget server with an NVIDIA T4. CPU baselines were run on an AMD EPYC and an Intel Xeon. Results show that diameter computation dominates the original CPU runtime (95 %–99.9 % of total time). With GPU acceleration, overall pipeline execution time (including file I/O) was reduced by up to 2 000× for the largest meshes on the H100, while the RTX 4070 achieved 8–24× speedups and the T4 still delivered 8–24× improvements over CPU. For small files, I/O overhead dominates and the speedup diminishes, highlighting the need for further I/O optimization (e.g., direct memory access, asynchronous streaming).

Compatibility is achieved through a setuptools‑based build process that detects the NVIDIA CUDA compiler (nvcc) and compiles the C/CUDA sources. At runtime a dispatcher checks for a functional CUDA device; if successful, it routes shape‑feature calls to the optimized kernels, otherwise it gracefully reverts to the original CPU functions. Consequently, existing PyRadiomics scripts require no modification: a simple from radiomics import featureextractor followed by ext.execute(scan, mask) works identically, with the added benefit of GPU acceleration when available.

Limitations include the current focus solely on shape features; first‑order statistics and texture descriptors (GLCM, GLRLM, etc.) remain CPU‑bound. Additionally, GPU memory constraints may become a bottleneck for extremely high‑resolution volumes, and the data transfer cost can offset gains for very small datasets. The authors propose extending CUDA support to texture calculations, employing memory‑efficient streaming techniques, and integrating asynchronous I/O to overlap data movement with computation.

In conclusion, PyRadiomics‑cuda delivers a practical, drop‑in GPU acceleration path for 3D shape radiomics, achieving dramatic speedups across a spectrum of hardware from consumer GPUs to state‑of‑the‑art AI accelerators. By maintaining full API compatibility and providing automatic fallback, it enables researchers and clinicians to scale radiomics analyses without rewriting pipelines, paving the way for large‑scale, high‑throughput imaging studies in precision medicine.


Comments & Academic Discussion

Loading comments...

Leave a Comment