Title: HYLU: Hybrid Parallel Sparse LU Factorization
ArXiv ID: 2509.07690
Date: 2025-09-09
Authors: Researchers from original ArXiv paper
📝 Abstract
This article introduces HYLU, a hybrid parallel LU factorization-based general-purpose solver designed for efficiently solving sparse linear systems (Ax=b) on multi-core shared-memory architectures. The key technical feature of HYLU is the integration of hybrid numerical kernels so that it can adapt to various sparsity patterns of coefficient matrices. Tests on 34 sparse matrices from SuiteSparse Matrix Collection reveal that HYLU outperforms Intel MKL PARDISO in the numerical factorization phase by geometric means of 2.04X (for one-time solving) and 2.58X (for repeated solving). HYLU can be downloaded from https://github.com/chenxm1986/hylu.
💡 Deep Analysis
Deep Dive into HYLU: Hybrid Parallel Sparse LU Factorization.
This article introduces HYLU, a hybrid parallel LU factorization-based general-purpose solver designed for efficiently solving sparse linear systems (Ax=b) on multi-core shared-memory architectures. The key technical feature of HYLU is the integration of hybrid numerical kernels so that it can adapt to various sparsity patterns of coefficient matrices. Tests on 34 sparse matrices from SuiteSparse Matrix Collection reveal that HYLU outperforms Intel MKL PARDISO in the numerical factorization phase by geometric means of 2.04X (for one-time solving) and 2.58X (for repeated solving). HYLU can be downloaded from https://github.com/chenxm1986/hylu
.
📄 Full Content
Solving sparse linear systems is a core task in scientific computing. Two types of methods to solve sparse linear systems are direct and iterative methods. Direct methods are generally suitable for small-to medium-scale sparse linear systems. Large-scale sparse linear systems are mostly solved by preconditioned iterative methods, as the time and memory overheads of direct methods will become unbearable.
The high performance of traditional sparse direct linear solvers heavily rely on gathering structurized nonzero elements into dense blocks and computing them using the level-3 basic linear algebra subroutines (BLAS), especially the general matrix multiplication (GEMM) function. Popular implementations include supernode-and multifrontalbased algorithms, based on which people have developed some software packages, such as UMFPACK [1], SuperLU [2], PARDISO [3], etc.
However, for a big portion of practical sparse linear systems which are highly sparse (e.g., linear systems from circuit simulation and power network simulation), such level-3 BLAS-based methods are inefficient. Instead, dedicated solvers have been developed for circuit simulation problems, including KLU [4], NICSLU [5], CKTSO [6], etc. They generally perform well on circuit matrices but may not efficiently solve linear systems from other areas. How to design a general-purpose sparse direct solver that can efficiently solve linear systems with a variety of sparsity patterns is a challenge. This article introduces HYLU, a hybrid parallel LU factorization-based general-purpose solver designed for efficiently solving sparse linear systems (Ax=b) on multi-core shared-memory architectures. To efficiently solve linear systems with various sparsities, HYLU integrates multiple numerical kernels and incorporates a smart kernel selection strategy based on the matrix sparsity, such that it can adapt to various sparse linear systems. Tests on 34 sparse matrices from SuiteSparse Matrix Collection [7] reveal that HYLU outperforms Intel MKL PARDISO in the numerical factorization phase by geometric means of 1.74X (for one-time solving) and 2.26X (for repeated solving).
Completely solving a sparse linear system by LU factorization needs 3 main phases: preprocessing, numerical factorization, and forward-backward substitution. This section briefly introduces the 3 steps of HYLU.
The preprocessing phase of HYLU includes 3 steps: static pivoting, ordering for fill-in reduction, and symbolic factorization. HYLU adopts the maximum weighted matching algorithm [8] for static pivoting. It permutes large elements to the diagonal, and also scales the matrix values, such that the diagonal elements are 1 or -1, and nondiagonal elements are bounded in [-1,1]. After static pivoting, matrix reordering is performed to minimize fill-ins which will be generated during LU factorization. The approximate minimum degree (AMD) algorithm [9], a modified implementation of AMD [10], and a modified nested dissection algorithm based on METIS [11], are adopted in HYLU for reordering. Finally a symbolic factorization is performed to form the symbolic structure of the LU factors, which will not change during numerical factorization. The number of floating-point operations is calculated during symbolic factorization, and supernodes are also detected. HYLU will select the numerical kernel based on these numbers and other information.
The sparse left-looking algorithm [12] is widely adopted by sparse linear solvers. HYLU uses the row-major order by default, so it uses the transposed version of the left-looking algorithm, which can be called an up-looking algorithm. The ordinary left-looking algorithm, which is adopted by KLU, is only suitable for extremely sparse matrices, such as matrices from circuit simulation problems. The concept of supernode [2] is widely used to explore the regularity in irregular sparse matrices of LU factorization. By gathering nonzero elements into dense sub-blocks, BLAS functions can be employed to accelerate sparse LU factorization. In HYLU, a supernode is defined as a set of consecutive rows which have the identical structure in U.
To make HYLU a general-purpose sparse linear solver that can efficiently solve a wide range of sparse linear systems, HYLU integrates multiple up-looking numerical kernels: a) a row-row kernel, b) a sup-row kernel, and c) a sup-sup kernel, as illustrated in Fig. 1. The row-row kernel is the ordinary up-looking algorithm. In this kernel, no BLAS function is needed. The numerical computation is implemented by plain C code. The sup-row kernel still updates a row at a time, but uses supernodes as source data. In this case, level-2 BLAS can be called to accelerate sup-row updates. In the sup-sup kernel, each time HYLU uses a supernode to update a supernode. Level-3 BLAS is employed to accelerate sup-sup updates. After all updates from previous supernodes and rows are completed, an internal factorization is performed to factorize the self supe