Partial wave analysis at BES III harnessing the power of GPUs
Partial wave analysis is a core tool in hadron spectroscopy. With the high statistics data available at facilities such as the Beijing Spectrometer III, this procedure becomes computationally very expensive. We have successfully implemented a framework for performing partial wave analysis on graphics processors. We discuss the implementation, the parallel computing frameworks employed and the performance achieved, with a focus on the recent transition to the OpenCL framework.
💡 Research Summary
The paper presents GPUPWA, a GPU‑accelerated framework for performing partial‑wave analyses (PWA) of high‑statistics data collected at the BES III experiment. PWA requires the evaluation of complex production and decay amplitudes for every event in each iteration of a fit, leading to a computational cost that scales with the number of events, the number of partial waves, and the number of fit iterations. Because each event’s amplitude calculation is independent, the problem is ideally suited to SIMD‑type parallelism offered by modern graphics processing units.
The authors first implemented the framework using the AMD‑specific Brook+ library, which provides a high‑level C++ abstraction over GPU kernels. All heavy calculations—tensor‑based amplitude construction, likelihood evaluation, and Monte‑Carlo integration—are executed on the GPU in single‑precision floating point; only the final reduction (the sum over all events) is performed in double precision to preserve numerical accuracy. The CPU handles the minimisation loop via the ROOT‑based Minuit2/FUMILI interface, while data I/O and histogramming remain on the host.
Subsequently the code was ported to the vendor‑independent OpenCL standard. This transition removed several memory‑allocation restrictions present in Brook+, allowed the use of a more efficient OpenCL compiler, and yielded an additional ~35 % speed‑up on the same hardware. Performance tests were carried out on an Intel Core 2 Quad 2.4 GHz workstation equipped with 2 GB RAM and an AMD Radeon 4870 GPU (512 MB). When analysing the benchmark channel J/ψ → γ K⁺K⁻, the GPU implementation was more than 150 times faster than the original single‑threaded Fortran code, and the OpenCL version outperformed the Brook+ version by the aforementioned 35 %.
Despite the dramatic acceleration of the core numerical kernels, the overall wall‑clock time of a typical fit is still dominated by data loading (≈0.75 s for 50 k events) and plot generation (≈1.5 s), indicating that further optimisation should target the surrounding workflow rather than the GPU kernels themselves. The authors also discuss practical portability issues: although OpenCL promises vendor‑neutral code, differences in CPU‑side vector type definitions caused difficulties when attempting to run the framework on macOS (Snow Leopard), highlighting the need for a more robust abstraction layer.
A significant methodological limitation identified is the handling of complex fit parameters. Current minimisers (FUMILI, Minuit2) operate on real‑valued Cartesian parameters, which can lead to strong correlations and poor convergence when the physical parameters are naturally expressed in polar form. The paper calls for the development of a complex‑aware optimiser that correctly treats derivatives of complex amplitudes.
In conclusion, the work demonstrates that massive parallelism via GPUs can overcome the computational bottlenecks of PWA, delivering order‑of‑magnitude speed‑ups over multi‑threaded CPU implementations and two‑order‑of‑magnitude gains over legacy single‑threaded codes. By adopting OpenCL, the GPUPWA framework becomes portable across major GPU vendors, paving the way for broader adoption in high‑energy physics analyses and for future extensions such as multi‑GPU scaling, advanced caching strategies, and integration of more sophisticated physics models.
Comments & Academic Discussion
Loading comments...
Leave a Comment