Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions
The sparse matrix-vector product (SpMV) is a fundamental operation in many scientific applications from various fields. The High Performance Computing (HPC) community has therefore continuously invested a lot of effort to provide an efficient SpMV kernel on modern CPU architectures. Although it has been shown that block-based kernels help to achieve high performance, they are difficult to use in practice because of the zero padding they require. In the current paper, we propose new kernels using the AVX-512 instruction set, which makes it possible to use a blocking scheme without any zero padding in the matrix memory storage. We describe mask-based sparse matrix formats and their corresponding SpMV kernels highly optimized in assembly language. Considering that the optimal blocking size depends on the matrix, we also provide a method to predict the best kernel to be used utilizing a simple interpolation of results from previous executions. We compare the performance of our approach to that of the Intel MKL CSR kernel and the CSR5 open-source package on a set of standard benchmark matrices. We show that we can achieve significant improvements in many cases, both for sequential and for parallel executions. Finally, we provide the corresponding code in an open source library, called SPC5.
💡 Research Summary
The paper addresses the long‑standing performance bottleneck of the sparse matrix‑vector product (SpMV) on modern CPUs. While block‑based formats such as BCSR or UBSCR can expose SIMD parallelism, they traditionally require zero‑padding of each block so that the block becomes dense. This padding inflates memory traffic, wastes cache space and often nullifies the expected speed‑up. The authors propose a fundamentally different approach that leverages the mask‑load capabilities of the AVX‑512 instruction set (e.g., vexpandpd/vexpandps) to eliminate padding entirely.
The new storage scheme, denoted β(r,c), partitions the matrix into blocks of r rows by c columns. Each block is described by four arrays: (1) values – the non‑zero entries stored in block order and row‑major inside each block; (2) block_masks – a r·c‑bit mask indicating which positions inside the block contain a non‑zero; (3) block_colidx – the column index of the block’s upper‑left element; and (4) block_rowptr – a cumulative count of blocks per r‑row interval. Because only the actual non‑zeros are stored, the memory footprint is essentially identical to CSR, regardless of block size.
The core SpMV kernel is written in hand‑tuned assembly. For each block the mask is loaded, and a single vexpand instruction scatters the required values from memory into a full‑width vector register. Subsequent fused‑multiply‑add operations accumulate the contribution of the block to the result vector. The authors apply several classic low‑level optimizations: loop unrolling, register reuse, 64‑byte alignment, prefetching of the next block and its mask, and OpenMP‑based thread parallelism that distributes blocks according to block_rowptr to achieve load balance.
Choosing the optimal block dimensions (r,c) is non‑trivial because the best size depends on the sparsity pattern. To automate this, the authors introduce a “record‑based prediction” technique. During an initial short profiling phase they run the SpMV with a small set of candidate block sizes, record the execution times, and fit a simple polynomial (for sequential runs) or a linear regression model (for multi‑threaded runs). The model then predicts the most promising block size for subsequent executions on the same matrix, eliminating the need for exhaustive autotuning.
Performance evaluation is carried out on a set of 30 representative matrices from the SuiteSparse collection, using an Intel Xeon Scalable (Cascade Lake) processor. Comparisons are made against the Intel MKL CSR implementation and the open‑source CSR5 library. In single‑core tests the new kernels achieve an average speed‑up of 1.3×–2.0× over MKL CSR, with a peak of 2.8× on matrices with irregular row lengths. In a 24‑core (2‑socket) configuration the speed‑ups range from 1.5× to 2.2×, demonstrating good scalability and reduced memory‑bandwidth pressure. Against CSR5, the mask‑based approach consistently shows higher FLOP/Byte ratios and lower bandwidth consumption, especially on matrices where the average number of non‑zeros per block is low.
All source code, including the assembly kernels, the profiling‑based predictor, and build scripts, is released as the SPC5 library (hosted on GitLab). The repository contains detailed documentation, making the method readily reproducible and usable by practitioners. By removing padding, exploiting AVX‑512 masks, and providing an automatic block‑size selector, the paper delivers a practical, high‑performance SpMV solution for current and future Intel CPUs. Future work could extend the mask‑based concept to GPUs, explore dynamic matrix reordering, and integrate the predictor into larger scientific applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment