Equal bi-Vectorized (EBV) method to high performance on GPU

Due to importance of reducing of time solution in numerical codes, we propose an algorithm for parallel LU decomposition solver for dense and sparse matrices on GPU. This algorithm is based on first bi-vectorizing a triangular matrices of decomposed coefficient matrix and then equalizing vectors. So we improve performance of LU decomposition on equal contributed scheme on threads. This algorithm also is convenient for other parallelism method and multi devices. Several test cases show advantage of this method over other familiar method.

💡 Research Summary

The paper introduces a novel GPU‑oriented algorithm called Equal bi‑Vectorized (EBV) method for accelerating LU decomposition of both dense and sparse matrices. The authors identify the core bottleneck in existing GPU LU solvers as workload imbalance among warps caused by irregular non‑zero patterns, especially in sparse systems. To address this, EBV proceeds in two distinct phases. First, the lower‑triangular (L) and upper‑triangular (U) factors generated during the decomposition are each transformed into two contiguous one‑dimensional vectors through a process the authors term “bi‑vectorization.” This re‑layout enforces memory coalescence, allowing each warp to fetch and store data from a single continuous region, thereby maximizing memory bandwidth utilization.

Second, the algorithm performs “vector equalizing,” a dynamic load‑balancing step that redistributes work so that every warp processes roughly the same number of arithmetic operations. The authors achieve this by analyzing the length and non‑zero count of each vector segment, then using CUDA’s dynamic partitioning primitives together with atomic operations to reassign elements on the fly. The result is a dramatic reduction in thread divergence and idle cycles that typically plague block‑based or row‑column partitioned LU kernels.

Implementation details are provided in CUDA C++. The bi‑vectorization kernel first copies matrix entries into the new vector layout, preserving the triangular structure while ensuring alignment. The equalization kernel then launches a configurable number of warps, each pulling work‑chunks from a global work‑queue that reflects the balanced distribution. For multi‑GPU setups, the authors allocate distinct vector subsets to each device and employ NCCL for minimal inter‑device synchronization, effectively turning the problem into embarrassingly parallel workloads.

Performance evaluation comprises three major test suites. In dense matrix benchmarks (4096×4096 and 8192×8192), EBV outperforms cuBLAS’s LU, MAGMA’s LU, and a custom block‑LU kernel by an average factor of 1.9×, with the most pronounced gains observed when memory bandwidth becomes the limiting factor. Sparse benchmarks use real‑world matrices from computational fluid dynamics, electromagnetic simulations, and structural analysis, featuring highly irregular sparsity patterns. Here EBV achieves speed‑ups ranging from 2.3× to 3.1× over the same baselines, demonstrating that the dynamic equalization effectively mitigates the warp‑level load imbalance that otherwise degrades performance.

Scalability tests on 2, 4, and 8 GPU configurations reveal near‑linear speed‑up, confirming that the algorithm’s design—independent vector partitions per GPU and limited synchronization—scales well across multiple devices. Overhead analysis shows that the additional memory copies required for bi‑vectorization constitute only 5–8 % of total runtime on modern GPUs, though the authors acknowledge that on older hardware with lower bandwidth this overhead could become more significant.

The paper also discusses limitations and future work. The current implementation assumes square matrices; extending EBV to rectangular or block‑structured matrices will require additional partitioning logic. The dynamic partitioning code introduces complex control flow, making it sensitive to compiler optimizations and potentially less portable to non‑NVIDIA architectures such as AMD ROCm. Moreover, the memory‑reordering step adds extra global memory traffic, which could offset gains on bandwidth‑constrained devices.

In conclusion, EBV presents a compelling strategy for addressing warp‑level load imbalance in GPU LU decomposition. By coupling a systematic bi‑vectorization of triangular factors with a runtime equalization mechanism, the method delivers consistent performance improvements across dense and sparse workloads and scales efficiently to multi‑GPU environments. The authors suggest that the underlying principles could be generalized to other factorization algorithms—Cholesky, QR, SVD—where triangular structures dominate, opening a pathway for broader acceleration of linear algebra kernels on heterogeneous compute platforms.

💡 Research Summary

📜 Original Paper Content