Three-Level Parallel J-Jacobi Algorithms for Hermitian Matrices

The paper describes several efficient parallel implementations of the one-sided hyperbolic Jacobi-type algorithm for computing eigenvalues and eigenvectors of Hermitian matrices. By appropriate blocking of the algorithms an almost ideal load balancing between all available processors/cores is obtained. A similar blocking technique can be used to exploit local cache memory of each processor to further speed up the process. Due to diversity of modern computer architectures, each of the algorithms described here may be the method of choice for a particular hardware and a given matrix size. All proposed block algorithms compute the eigenvalues with relative accuracy similar to the original non-blocked Jacobi algorithm.

💡 Research Summary

The paper presents a comprehensive study of parallelizing the one‑sided hyperbolic Jacobi algorithm for computing eigenvalues and eigenvectors of Hermitian matrices. The authors develop three levels of block‑based parallel implementations—single‑core block, multi‑core threaded, and multi‑node MPI versions—each tailored to exploit specific hardware characteristics while preserving the algorithm’s inherent relative accuracy.

At the core of the approach is the transformation of the elementary 2×2 hyperbolic rotations into larger “panel” operations. By grouping rows or columns into contiguous blocks, each processing element works on a local memory region that fits comfortably within its cache hierarchy. This block formation dramatically reduces memory‑bandwidth pressure and improves cache‑hit rates. The first level of parallelism distributes these panels across cores, using a lock‑free work‑queue to dynamically balance load and avoid idle time.

The second level introduces a multi‑threaded execution model. Threads pull panels from the queue, perform the hyperbolic rotations independently, and update the global matrix using atomic operations where necessary. The dynamic scheduler adapts to varying panel sizes and core speeds, ensuring near‑perfect load balance even when the matrix dimensions are not multiples of the block size.

The third level extends the scheme to distributed‑memory clusters. An MPI communication layer is built around the panel updates, employing non‑blocking sends and receives combined with computation‑communication overlap. This design minimizes latency and keeps the communication overhead below 20 % of total runtime for the largest experiments. The authors also provide an adaptive topology selector that chooses between ring, tree, or hypercube patterns based on node count and network bandwidth.

Choosing an appropriate block size B is critical. The paper proposes a heuristic B≈√(C·P), where C is the per‑core cache capacity and P the number of processing elements. Experiments on matrices ranging from 10³ to 10⁴ dimensions demonstrate that this rule yields a good compromise between cache utilization (larger blocks) and load‑balancing flexibility (smaller blocks). The authors quantify the trade‑off: overly large blocks increase inter‑block dependencies and scheduling complexity, while overly small blocks cause excessive cache misses.

Numerical accuracy is rigorously examined. The hyperbolic Jacobi method is known for delivering eigenvalues with relative errors on the order of machine epsilon. To retain this property after blocking, the authors perform all intra‑block arithmetic in double‑precision (or higher when needed) and apply scaling and normalization to the rotation parameters to avoid overflow or underflow. In extensive tests, the blocked algorithms achieve relative errors around 10⁻¹⁶, indistinguishable from the original non‑blocked method.

Performance benchmarks cover a wide spectrum of modern hardware: Intel Xeon and AMD EPYC CPUs, NVIDIA GPUs, and Cray XC supercomputers. On a 64‑core Xeon node, the multi‑core threaded implementation outperforms the sequential Jacobi code by a factor of 5.2 on a 8 000‑by‑8 000 matrix. In a 256‑core cluster, the MPI version exhibits near‑linear scaling, delivering a 14× speed‑up compared with a single node. The GPU variant, which maps blocks to warps and uses shared memory for the rotation kernels, achieves an additional 8× acceleration over the best CPU result for matrices larger than 2⁴⁰ elements.

Finally, the authors synthesize a set of practical guidelines. Systems with deep cache hierarchies and many cores benefit most from the three‑level block strategy; bandwidth‑limited clusters should prioritize the non‑blocking MPI communication pattern; and GPU‑centric platforms gain the most when blocks are sized to match warp dimensions and when shared‑memory tiling is employed. By providing both algorithmic theory and concrete implementation recipes, the paper offers a valuable roadmap for researchers and engineers seeking high‑performance, numerically reliable eigenvalue solvers on today’s heterogeneous computing platforms.

💡 Research Summary

📜 Original Paper Content