Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded number of processors; our experiments show scaling up to thousands of processors in a variety of test scenarios.

💡 Research Summary

The paper addresses the challenge of scaling sparse matrix‑matrix multiplication (SpGEMM) and related sparse matrix operations to thousands of processors in distributed‑memory environments. The authors observe that traditional one‑dimensional (1D) data distributions suffer from prohibitive communication costs and memory overhead because local submatrices, stored in conventional CSC/CSR formats, become hypersparse—most rows or columns are empty—yet the data structures still allocate space for every column or row. To overcome these limitations, the authors propose two key innovations.

First, they adopt a two‑dimensional (2D) processor grid (p = p_r × p_c) and introduce a hypersparse storage format called DCSC (doubly‑compressed sparse column). DCSC stores only non‑empty columns together with their column indices, eliminating the repetitions that plague CSC. This reduces the memory footprint of a distributed matrix from O(n √p + nnz) to O(nnz) and enables fast column indexing via an auxiliary array built in linear time.

Second, they develop a parallel SpGEMM algorithm named Sparse SUMMA, which mirrors the dense SUMMA algorithm but operates on DCSC blocks. A blocking parameter b (chosen to divide k/p_r and k/p_c) determines the granularity of broadcast. In each iteration, the processor row broadcasts a vertical block of A, the processor column broadcasts a horizontal block of B, and every processor locally multiplies the received blocks using HyperSparseGEMM. HyperSparseGEMM is an outer‑product formulation that merges partial products on the fly with a priority queue, achieving a time complexity of O(nzc(A) + nzr(B) + flops·log ni) and memory usage O(nnz(A)+nnz(B)+nnz(C)), independent of matrix dimensions.

The authors provide a detailed asymptotic analysis. Assuming square n × n matrices with an average of d nonzeros per row/column, they derive communication cost T_comm = Θ(α√p + β d n √p) and computation cost T_comp = O(d n √p + d² n p log(d² n p)). Although efficiency degrades as √p grows, the algorithm is not bounded by a fixed processor count; the dominant terms scale only with √p, allowing near‑linear speedup for moderate p and graceful √p scaling thereafter.

Building on this scalable SpGEMM, the paper introduces parallel algorithms for sparse matrix indexing (SpRef) and sparse matrix assignment (SpAsgn). SpRef extracts a submatrix B = A(I,J) by constructing Boolean selector matrices R_I and C_J from index vectors I and J, then computing R_I·A·C_J using two SpGEMM calls. SpAsgn performs A(I,J) = B by a similar composition, optionally adding to the existing submatrix. Both operations inherit the communication and computation bounds of the underlying SpGEMM, providing the first general‑purpose distributed algorithms for arbitrary index vectors.

Experimental evaluation on large clusters (Cray XT4, IBM Blue Gene/P) demonstrates that the 2D Sparse SUMMA implementation attains 70 %+ parallel efficiency up to 4 000 cores, outperforms 1D approaches by factors of 3–5, and scales better than the Trilinos SpGEMM implementation, especially on irregular graphs with highly non‑uniform sparsity patterns. SpRef and SpAsgn also show substantial speedups (2–5×) while keeping memory consumption proportional to the total number of nonzeros, thanks to DCSC.

In conclusion, the paper shows that a combination of hypersparse storage (DCSC) and a flexible 2D block‑broadcast algorithm (Sparse SUMMA) yields a robust, scalable foundation for a wide range of sparse linear algebra primitives. This work enables efficient large‑scale graph analytics, algebraic multigrid, and dynamic streaming updates on modern high‑performance computing systems.

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

💡 Research Summary

Comments & Academic Discussion

Leave a Comment