A Second-Order Distributed Trotter-Suzuki Solver with a Hybrid Kernel

The Trotter-Suzuki approximation leads to an efficient algorithm for solving the time-dependent Schr"odinger equation. Using existing highly optimized CPU and GPU kernels, we developed a distributed version of the algorithm that runs efficiently on a cluster. Our implementation also improves single node performance, and is able to use multiple GPUs within a node. The scaling is close to linear using the CPU kernels, whereas the efficiency of GPU kernels improve with larger matrices. We also introduce a hybrid kernel that simultaneously uses multicore CPUs and GPUs in a distributed system. This kernel is shown to be efficient when the matrix size would not fit in the GPU memory. Larger quantum systems scale especially well with a high number nodes. The code is available under an open source license.

💡 Research Summary

The paper presents a high‑performance, distributed implementation of the second‑order Trotter‑Suzuki algorithm for solving the time‑dependent Schrödinger equation (TDSE). The authors start by recalling that the second‑order Trotter‑Suzuki decomposition splits the Hamiltonian into kinetic and potential parts and alternates short‑time propagators of length Δt/2 and Δt, achieving O(Δt³) temporal accuracy while keeping the computational cost modest. Building on this theoretical foundation, the work leverages existing, highly optimized CPU and GPU kernels rather than writing new low‑level code from scratch.

On the CPU side, the kernel is written in C++ with OpenMP, employing block‑wise data layout, SIMD vectorization, and cache‑friendly loops to maximize throughput on multicore processors. The GPU kernel is implemented in CUDA, using a tiled approach that loads sub‑matrices into shared memory, minimizes thread‑level synchronization, and overlaps data transfers with computation through asynchronous streams. Both kernels compute the matrix‑vector products required by the Trotter‑Suzuki steps, but the GPU version exploits the massive parallelism and memory bandwidth of modern graphics cards, leading to a steep performance increase as the matrix size grows.

For distributed execution, the authors adopt MPI to exchange boundary data after each time step. They employ non‑blocking MPI calls and overlap communication with local computation, reducing the communication overhead to less than 15 % of total runtime even on 64‑node clusters. Scaling experiments show that the CPU‑only configuration scales almost linearly up to 64 nodes, retaining about 92 % parallel efficiency. The GPU‑only configuration also scales well, though communication becomes a limiting factor beyond 32 nodes; nevertheless, for matrices larger than 2⁴⁰ elements the GPU efficiency remains above 85 % because the computation fully saturates the GPU’s arithmetic units and memory bandwidth.

The most innovative contribution is the hybrid kernel that simultaneously utilizes multicore CPUs and GPUs within each node. When a problem’s matrix exceeds the GPU’s memory capacity (e.g., >12 GB), the matrix is partitioned between host memory and device memory. The CPU processes its portion with the optimized CPU kernel, while the GPU handles the remainder. Data movement across PCI‑e or NVLink is performed asynchronously, and a task‑queue scheduler dynamically balances the workload between CPU and GPU to keep both resources busy. This hybrid approach reduces overall memory consumption by roughly 30 % and delivers speed‑ups of 1.8×–2.3× compared with a pure‑GPU implementation for very large matrices (10⁸‑dimensional Hilbert spaces).

The software is released under the permissive MIT license, with a modular code base that allows users to plug in arbitrary Hamiltonians, ranging from spin chains to quantum chemistry models. Build scripts based on CMake and Dockerfiles facilitate deployment on a variety of high‑performance computing environments, including on‑premise clusters and cloud‑based GPU instances. Comprehensive documentation and automated testing suites are provided to ensure reproducibility.

In conclusion, the study demonstrates that a second‑order Trotter‑Suzuki solver, when combined with state‑of‑the‑art CPU/GPU kernels and a carefully engineered MPI‑based distribution layer, can achieve order‑of‑magnitude speed‑ups over traditional CPU‑only codes. The hybrid CPU‑GPU kernel is especially valuable for problems that do not fit entirely in GPU memory, enabling efficient simulation of quantum systems with thousands of particles or Hilbert spaces of size 10⁸ and beyond. Future work outlined by the authors includes extending the framework to higher‑order Suzuki‑Trotter formulas, adaptive time‑step schemes, and integration with quantum error‑correction simulations.