Solving Dense Generalized Eigenproblems on Multi-threaded Architectures
We compare two approaches to compute a portion of the spectrum of dense symmetric definite generalized eigenproblems: one is based on the reduction to tridiagonal form, and the other on the Krylov-subspace iteration. Two large-scale applications, arising in molecular dynamics and material science, are employed to investigate the contributions of the application, architecture, and parallelism of the method to the performance of the solvers. The experimental results on a state-of-the-art 8-core platform, equipped with a graphics processing unit (GPU), reveal that in real applications, iterative Krylov-subspace methods can be a competitive approach also for the solution of dense problems.
💡 Research Summary
The paper investigates two fundamentally different strategies for solving dense symmetric‑definite generalized eigenvalue problems (GSYEIG) of the form AX = BXΛ, where both A and B are dense n × n matrices and B is symmetric positive definite. After a standard reduction of the generalized problem to a standard eigenproblem C Y = YΛ via a Cholesky factorization of B (B = U T Uᵀ) and the similarity transformation C = U⁻ᵀAU⁻¹, the authors compare (i) a tridiagonal‑reduction approach and (ii) a Krylov‑subspace (Lanczos) iteration.
The tridiagonal‑reduction route is examined in two variants. The “direct” variant (td) reduces C directly to a symmetric tridiagonal matrix T using Householder reflectors, solves the resulting small tridiagonal eigenproblem with the MR³ algorithm, and finally back‑transforms the eigenvectors. Its computational cost is dominated by the O(n³) reduction (≈ 4 n³/3 FLOPs) plus O(n s) for extracting s eigenpairs; memory overhead is minimal because the orthogonal matrix Q is never formed explicitly. The “two‑stage” variant (tt) first compresses C to a banded matrix W (cost ≈ 4 n³/3 FLOPs) and then reduces W to tridiagonal form using Level‑3 BLAS kernels. Although this improves arithmetic intensity, it requires explicit construction of the first orthogonal factor Q₁ (≈ 4 n³/3 FLOPs) and additional O(n²) storage, raising overall cost to roughly 7 n³/3 FLOPs plus the eigenpair extraction.
The Krylov‑subspace approach builds an orthogonal basis of the Krylov subspace via the Lanczos three‑term recurrence. Two implementation paths are considered. In variant ke, the matrix C is explicitly formed once; each Lanczos step then requires a dense matrix‑vector product (2 n² FLOPs). Orthogonalization is performed twice per step to mitigate loss of orthogonality, incurring up to O(m n) extra work, where m is the Krylov dimension (typically m ≪ n). Restarts involve QR iterations on the small tridiagonal matrix, costing O(n m²) FLOPs per restart. In variant ki, C is never built; each iteration computes zₖ₊₁ = U⁻ᵀ A U⁻¹ wₖ by solving two triangular systems and a matrix‑vector product, doubling the per‑step cost to 4 n² FLOPs but eliminating the O(n³) upfront cost of forming C.
The experimental study uses two real‑world applications that generate dense GSYEIGs. The first is a normal‑mode analysis (NMA) of a biomolecular system in internal coordinates, with n ≈ 10 000 and only about 1 % of the smallest eigenpairs required. The second is an ab‑initio density‑functional‑theory (DFT) simulation, where n ranges from 30 000 to 100 000 and a few low‑energy eigenpairs are needed for self‑consistent field iterations. Both problems feature dense A and B, and B is SPD.
All tests run on a hybrid workstation equipped with an 8‑core Intel Xeon CPU and an NVIDIA “Fermi” GPU. The authors employ a range of linear‑algebra libraries: classic LAPACK/BLAS, multi‑threaded PLASMA and libflame, and GPU‑aware MAGMA.
Results show that on the CPU‑only configuration, the direct tridiagonal‑reduction (td) is competitive because its Level‑3 BLAS kernels are well‑optimized. However, when the GPU is engaged, the tridiagonal‑reduction suffers from substantial data‑movement overhead and from the fact that a large fraction of its work (the reduction to tridiagonal form) still relies on Level‑2 BLAS operations, limiting speed‑up. In contrast, the Krylov‑subspace methods map naturally onto the GPU: the dominant operation is the dense matrix‑vector product, which achieves high throughput on the device. Consequently, the Krylov variants (both ke and ki) attain 2–3× faster runtimes than the tridiagonal approaches for the larger problem sizes, especially when the number of desired eigenpairs s is small. The initial O(n³) transformation cost (Cholesky and, in the case of ke, explicit C construction) remains a fixed overhead, but because s ≪ n, the iterative Krylov approach amortizes this cost more efficiently than the full reduction to tridiagonal form.
The authors conclude that, contrary to the long‑standing belief that dense eigenproblems should be solved via direct tridiagonal reduction, Krylov‑subspace iterations become the method of choice on modern multi‑threaded and GPU‑accelerated architectures for applications requiring only a modest subset of eigenpairs. The decision between the two families should consider the number of eigenpairs, the target hardware’s memory bandwidth and compute balance, and the relative cost of data movement. Future work is suggested on block‑Lanczos schemes, shift‑and‑invert spectral transformations, and more sophisticated GPU task‑scheduling to further close the performance gap and extend the applicability to larger s or to problems with tighter convergence requirements.
Comments & Academic Discussion
Loading comments...
Leave a Comment