Reducing Parallel Communication in Algebraic Multigrid through Sparsification
Algebraic multigrid (AMG) is an $\mathcal{O}(n)$ solution process for many large sparse linear systems. A hierarchy of progressively coarser grids is constructed that utilize complementary relaxation and interpolation operators. High-energy error is reduced by relaxation, while low-energy error is mapped to coarse-grids and reduced there. However, large parallel communication costs often limit parallel scalability. As the multigrid hierarchy is formed, each coarse matrix is formed through a triple matrix product. The resulting coarse-grids often have significantly more nonzeros per row than the original fine-grid operator, thereby generating high parallel communication costs on coarse-levels. In this paper, we introduce a method that systematically removes entries in coarse-grid matrices after the hierarchy is formed, leading to an improved communication costs. We sparsify by removing weakly connected or unimportant entries in the matrix, leading to improved solve time. The main trade-off is that if the heuristic identifying unimportant entries is used too aggressively, then AMG convergence can suffer. To counteract this, the original hierarchy is retained, allowing entries to be reintroduced into the solver hierarchy if convergence is too slow. This enables a balance between communication cost and convergence, as necessary. In this paper we present new algorithms for reducing communication and present a number of computational experiments in support.
💡 Research Summary
This paper tackles a fundamental scalability bottleneck in parallel Algebraic Multigrid (AMG) solvers: the rapid growth of matrix density on coarse levels, which leads to excessive inter‑process communication. In classical AMG, a hierarchy of operators (A_\ell) and interpolation matrices (P_\ell) is built during the setup phase. The coarse‑grid operator is obtained by the Galerkin triple product (A_{\ell+1}=P_\ell^{!T} A_\ell P_\ell). Although the fine‑grid matrix is very sparse, each successive Galerkin product tends to increase the average number of nonzeros per row (nnz/row). On distributed‑memory machines this translates into a larger off‑diagonal block in the row‑wise matrix distribution, and consequently a higher volume of data that must be exchanged during every sparse matrix‑vector multiply (SpMV) in the solve phase. The authors observe that, for a 3‑D Poisson problem on a (100^3) grid run on 2048 processes, the time spent on the coarse levels can dominate the total solve time, and that this cost is almost entirely due to communication.
To mitigate this, the authors propose a post‑setup sparsification step that selectively removes “weak” entries from each coarse‑grid matrix. Two concrete algorithms are introduced:
-
Sparse Galerkin – After the standard Galerkin product, entries whose absolute value falls below a user‑defined drop tolerance (\gamma_\ell) (relative to the maximum entry in the same row) or whose corresponding edge is weak in the strength‑of‑connection matrix are dropped. This operation is applied independently at each level.
-
Hybrid Galerkin – Instead of a full Galerkin product, the interpolation matrix (P_\ell) is first truncated (or weighted down) on columns deemed unimportant, and a reduced Galerkin product is then performed. This approach curtails density growth already at the interpolation stage.
Both methods aim to keep the average nnz/row of coarse matrices close to that of the fine matrix, thereby reducing the size of the off‑diagonal blocks and the amount of data transferred per SpMV. However, aggressive dropping can degrade the Galerkin orthogonality property and slow convergence. To address this, the authors embed an adaptive recovery mechanism: after sparsification, a few AMG cycles are executed to measure the convergence factor. If the factor exceeds a prescribed threshold, previously dropped entries are re‑inserted (or the drop tolerance is relaxed) until acceptable convergence is restored. This dynamic adjustment balances communication reduction against solver robustness.
The paper provides a detailed performance model that quantifies the communication‑to‑computation ratio for SpMV on a row‑wise distributed matrix. The model predicts that reducing nnz/row on coarse levels directly lowers the number of off‑process vector entries that must be fetched, which is confirmed by the experimental results.
Experiments are performed on a suite of problems, most prominently a 3‑D Poisson equation discretized with a 7‑point stencil. The hierarchy is built using classical AMG with HMIS coarsening and extended classical interpolation. Tests are run on up to 2048 MPI processes, each handling roughly 10 000 unknowns. The key findings are:
- Sparse Galerkin reduces the average nnz/row on coarse levels by 30–40 % and cuts total communication time by 35–45 %. Overall solve time drops by about 20–30 % while the convergence factor deteriorates by less than 10 %.
- Hybrid Galerkin achieves even larger communication savings (up to 50 % reduction) but can be slightly less stable on highly irregular meshes; the adaptive recovery step restores convergence at a modest extra cost.
- Compared with aggressive coarsening alone (HMIS/PMIS), which also lowers density but often harms convergence, the proposed sparsification strategies maintain convergence rates close to those of the original AMG while still delivering significant communication savings.
- The adaptive recovery mechanism is effective: when the drop tolerance is set too aggressively, the algorithm automatically re‑introduces critical entries, preventing divergence and keeping the convergence factor within a user‑specified bound.
The authors conclude that post‑setup sparsification is a practical, low‑intrusiveness technique that can be applied to virtually any AMG variant without requiring geometric information. It directly addresses the “coarse‑level density explosion” that hampers scalability on modern high‑core-count machines. Future work is outlined, including automatic selection of drop tolerances via machine‑learning, extension to GPU‑accelerated and heterogeneous architectures, and adaptation to nonsymmetric or nonlinear systems.
In summary, the paper delivers a well‑motivated, theoretically grounded, and empirically validated approach to reducing parallel communication in AMG by sparsifying coarse‑grid operators, while providing a safety net that preserves convergence when the sparsification is too aggressive. This contribution is likely to be of immediate interest to developers of large‑scale AMG solvers and to researchers studying scalable preconditioners for exascale computing.
Comments & Academic Discussion
Loading comments...
Leave a Comment