Multi-GPU implementation of a VMAT treatment plan optimization algorithm

VMAT optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units have been used to speed up the computations. However, its small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix. This paper is to report an implementation of our column-generation based VMAT algorithm on a multi-GPU platform to solve the memory limitation problem. The column-generation approach generates apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. The DDC matrix is split into four sub-matrices according to beam angles, stored on four GPUs in compressed sparse row format. Computation of beamlet price is accomplished using multi-GPU. While the remaining steps of PP and MP problems are implemented on a single GPU due to their modest computational loads. A H&N patient case was used to validate our method. We compare our multi-GPU implementation with three single GPU implementation strategies: truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3). Two more H&N patient cases and three prostate cases were also used to demonstrate the advantages of our method. Our multi-GPU implementation can finish the optimization within ~1 minute for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 seconds shorter than the multi-GPU implementation. S2 and S3 yield same plan quality as the multi-GPU implementation but take ~4 minutes and ~6 minutes, respectively. High computational efficiency was consistently achieved for the other 5 cases. The results demonstrate that the multi-GPU implementation can handle the large-scale VMAT optimization problem efficiently without sacrificing plan quality.

💡 Research Summary

The paper addresses the computational bottleneck of volumetric modulated arc therapy (VMAT) treatment‑plan optimization, which stems from the massive size of the dose‑deposition coefficient (DDC) matrix, the high number of degrees of freedom, and numerous hardware constraints. While graphics processing units (GPUs) have previously been employed to accelerate VMAT calculations, their limited on‑board memory prevents the direct handling of large DDC matrices that are typical for complex cases such as head‑and‑neck (H&N) treatments.

To overcome this limitation, the authors implement their previously developed column‑generation VMAT algorithm on a multi‑GPU platform. The DDC matrix is partitioned into four sub‑matrices according to beam angles and stored in compressed sparse row (CSR) format on four separate GPUs. The most computationally intensive step—calculating the beamlet price in the pricing problem (PP)—is performed concurrently across all GPUs. After each GPU computes its local price contributions, the results are aggregated, and the subsequent steps of the PP as well as the master problem (MP) are solved on a single GPU because their computational load is modest. This design minimizes inter‑GPU communication while fully exploiting data parallelism for the dominant operation.

The authors evaluate the approach on a representative H&N patient case and compare it with three single‑GPU strategies: (S1) truncating the DDC matrix to fit GPU memory, (S2) repeatedly transferring the full DDC matrix between CPU and GPU, and (S3) off‑loading DDC‑related calculations to the CPU. S1 yields a plan of inferior quality despite being 10 seconds faster than the multi‑GPU solution. S2 and S3 achieve plan quality identical to the multi‑GPU method but require roughly 4 minutes and 6 minutes, respectively, because of data‑transfer overhead and CPU‑GPU inefficiencies. In contrast, the multi‑GPU implementation completes the optimization in approximately one minute without compromising dosimetric quality.

To demonstrate robustness, the authors repeat the experiments on two additional H&N cases and three prostate cases. In all six cases, the multi‑GPU approach consistently delivers high computational efficiency and maintains plan quality comparable to the best single‑GPU baselines. The results confirm that distributing the DDC matrix across multiple GPUs eliminates memory constraints while preserving the speed advantage of GPU acceleration.

The study contributes a practical solution for large‑scale VMAT optimization: by coupling a column‑generation framework with a straightforward data‑parallel multi‑GPU scheme, it resolves the memory‑time trade‑off that has limited previous GPU‑based implementations. The authors suggest that further gains could be realized by leveraging high‑speed GPU interconnects (e.g., NVLink) or scaling to larger GPU clusters, opening the door to real‑time adaptive radiotherapy and batch processing of extensive patient cohorts.

💡 Research Summary

📜 Original Paper Content