A domain decomposing parallel sparse linear system solver
The solution of large sparse linear systems is often the most time-consuming part of many science and engineering applications. Computational fluid dynamics, circuit simulation, power network analysis
The solution of large sparse linear systems is often the most time-consuming part of many science and engineering applications. Computational fluid dynamics, circuit simulation, power network analysis, and material science are just a few examples of the application areas in which large sparse linear systems need to be solved effectively. In this paper we introduce a new parallel hybrid sparse linear system solver for distributed memory architectures that contains both direct and iterative components. We show that by using our solver one can alleviate the drawbacks of direct and iterative solvers, achieving better scalability than with direct solvers and more robustness than with classical preconditioned iterative solvers. Comparisons to well-known direct and iterative solvers on a parallel architecture are provided.
💡 Research Summary
The paper addresses the long‑standing challenge of solving very large sparse linear systems that arise in many scientific and engineering domains such as computational fluid dynamics, circuit simulation, power‑grid analysis, and materials modeling. Traditional direct solvers (e.g., LU or Cholesky factorization) provide robust, accurate solutions but suffer from prohibitive memory consumption and cubic computational complexity, which limits their scalability on distributed‑memory clusters. Conversely, iterative Krylov‑subspace methods with incomplete‑factorization preconditioners are memory‑efficient and scale well, yet their convergence can be erratic and highly dependent on the quality of the preconditioner. To combine the strengths of both approaches, the authors propose a hybrid parallel solver that integrates domain decomposition, local direct factorizations, and a global iterative scheme based on a Schur‑complement preconditioner.
The algorithm proceeds in three major phases. First, the global sparse matrix is partitioned into a set of subdomains using graph‑partitioning tools (e.g., METIS) that aim to balance the number of unknowns per subdomain while minimizing the size of the interface (boundary) set. This partitioning defines a natural data distribution for a distributed‑memory environment and reduces inter‑process communication. Second, each subdomain independently performs a direct factorization of its interior matrix (LU for nonsymmetric systems, Cholesky for symmetric positive‑definite systems). Because the subdomains are much smaller than the original problem, the factorization cost scales as O((N/p)³) rather than O(N³), and the memory footprint is reduced accordingly. The factorized subdomains also provide exact local solves that will be used later as a powerful preconditioner.
Third, the global coupling across subdomains is expressed through the Schur complement on the interface degrees of freedom. Rather than forming the full Schur matrix explicitly, the algorithm evaluates its action on a vector by invoking the already‑computed local direct solves. This matrix‑vector product is then embedded in a Krylov‑subspace iterative method such as GMRES or BiCGSTAB. The preconditioner consists of the block‑diagonal operator formed by the local direct solves, which dramatically improves the spectral properties of the Schur system and reduces the number of outer iterations.
Communication is limited to two patterns. Boundary data exchange between neighboring subdomains is performed with point‑to‑point non‑blocking messages and involves only a small amount of data. Global reductions (inner products, norms) required by the Krylov method are carried out with collective operations (MPI_Allreduce). This design keeps communication overhead low even when scaling to thousands of processes.
The authors evaluate the solver on a suite of benchmark matrices drawn from real‑world applications, comparing against well‑established parallel direct solvers (MUMPS, SuperLU_DIST) and standard preconditioned iterative solvers (PETSc with ILU, Trilinos). Strong‑scaling experiments ranging from 64 to 4096 MPI ranks demonstrate that the hybrid solver maintains near‑linear speedup, whereas pure direct solvers quickly run out of memory and exhibit sub‑linear scaling. In terms of convergence, the hybrid method achieves residuals on the order of 10⁻⁸ with 2–5 times fewer iterations than the best ILU‑preconditioned runs, translating into overall solution times that are 2–3 times faster than the iterative baseline. Memory consumption is reduced by 30–50 % compared with the direct solvers, confirming the efficiency of the domain‑decomposition strategy.
The discussion highlights several important insights. The quality of the graph partition directly influences load balance and communication volume; adaptive or multilevel partitioning may further improve performance for highly irregular problems. The choice of subdomain size is a trade‑off: larger subdomains reduce the size of the Schur system but increase local factorization cost and memory usage. The authors also note that the current implementation assumes static partitions; dynamic load‑balancing could be explored for problems with evolving sparsity patterns. Future work includes extending the framework to heterogeneous architectures (GPU‑accelerated local factorizations), integrating multilevel preconditioners, and investigating fault‑tolerant variants for exascale systems.
In conclusion, the paper presents a compelling hybrid parallel algorithm that successfully merges the numerical robustness of direct methods with the scalability of iterative solvers. By exploiting domain decomposition and a Schur‑complement preconditioner built from exact local solves, the approach overcomes the memory and scalability bottlenecks of traditional direct solvers while delivering far more reliable convergence than conventional preconditioned Krylov methods. The extensive experimental results validate the theoretical advantages and suggest that this solver could become a key component in next‑generation large‑scale scientific simulations.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...