Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication

Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix with a dense vector. SpMV plays a crucial role in many applications, from graph analytics to deep learning. The random memory accesses of the sparse matrix make accelerator design challenging. However, high bandwidth memory (HBM) based FPGAs are a good fit for designing accelerators for SpMV. In this paper, we present Serpens, an HBM based accelerator for generalpurpose SpMV, which features memory-centric processing engines and index coalescing to support the efficient processing of arbitrary SpMVs. From the evaluation of twelve large-size matrices, Serpens is $1.91 \times$ and $1.76 \times$ better in terms of geomean throughput than the latest accelerators GraphLiLy and Sextans, respectively. We also evaluate 2,519 SuiteSparse matrices, and Serpens achieves $2.10 \times$ higher throughput than a K80 GPU. For the energy/bandwidth efficiency, Serpens is $1.71 \times / 1.99 \times, 1.90 \times / 2.69 \times$, and $6.25 \times / 4.06 \times$ better compared with GraphLily, Sextans, and K80, respectively. After scaling up to 24 HBM channels, Serpens achieves up to $60.55 \mathrm{GFLOP} / \mathrm{s}$ (30,204 MTEPS) and up to $3.79 \times$ over GraphLily. The code is available at https://github.com/UCLA-VAST/Serpens.


💡 Research Summary

The paper introduces Serpens, a high‑bandwidth‑memory (HBM) based FPGA accelerator designed specifically for general‑purpose sparse matrix‑vector multiplication (SpMV). SpMV is a fundamental kernel in many domains such as graph analytics, scientific computing, and sparse neural networks, but its performance is traditionally limited by irregular memory accesses and low data reuse. The authors argue that HBM‑enabled FPGAs, with their massive external bandwidth, are well‑suited to address these challenges, yet existing HBM‑based accelerators (GraphLily, Sextans) either allocate resources for broader graph workloads or are optimized for sparse matrix‑matrix multiplication (SpMM), leaving a gap for a dedicated SpMV solution.

Architecture Overview
Serpens targets the Xilinx Alveo U280 platform, which provides 32 HBM channels (460 GB/s total). The design reserves 16 channels for the sparse matrix A, and one channel each for the input vector x, the input vector y, and the output vector y, totaling 19 channels and 273 GB/s of usable bandwidth. Each HBM channel is paired with a 512‑bit read/write (Rd/Wr) module. For dense vectors, 16 FP32 values are packed into a single 512‑bit word; for sparse elements, row and column indices (compressed to 32 bits) and a 32‑bit value are combined into a 64‑bit representation, and eight such elements are packed per 512‑bit transaction.

The core of the accelerator consists of memory‑centric processing engines (PEs). A single HBM channel streams sparse elements to eight PEs in parallel. The dense vector x is partitioned into segments of length 8192 and stored in on‑chip BRAM; each segment is broadcast to all PEs via a chain topology, allowing a high operating frequency while avoiding bank conflicts. Accumulation of partial results is performed in URAMs, which act as row‑wise buffers. By keeping all random accesses (reading x entries and writing to y) on‑chip, Serpens eliminates the latency and bandwidth penalties associated with off‑chip random memory traffic.

Index Coalescing and Reordering
To improve URAM utilization, the authors introduce index coalescing: two consecutive rows’ results are stored in the same URAM address, effectively halving the address space needed for accumulation. This requires a careful coloring and reordering step to avoid write‑after‑read (WAR) hazards and URAM bank conflicts. The paper adapts the coloring scheme used in Sextans but simplifies it because only two consecutive rows can share a color, given the two‑cycle DSP latency. The resulting schedule guarantees that no two elements mapped to the same URAM address are processed within the same two‑cycle window, preserving correctness while maximizing on‑chip storage efficiency.

Resource and Performance Modeling
Analytical models estimate BRAM usage as 32 · HA (where HA is the number of HBM channels allocated to A) and URAM usage as 8 · HA · U (U is the number of URAMs per PE). The pipeline achieves an initiation interval (II) of 1, leading to a cycle count dominated by streaming the dense vector (K/16 cycles) and the sparse matrix (NNZ/(8·HA) cycles). Scaling the number of HBM channels from 16 to 24 yields near‑linear throughput improvements, confirming the design’s scalability.

Experimental Evaluation
Serpens is evaluated on two fronts:

  1. Twelve large‑scale matrices (including Hollywood, nlpkkt120, etc.) – Serpens outperforms GraphLily by 1.91× and Sextans by 1.76× in geometric‑mean throughput. Energy efficiency improves by 1.71× over GraphLily and 1.90× over Sextans.

  2. 2,519 SuiteSparse matrices – Compared against an NVIDIA K80 GPU, Serpens achieves 2.10× higher throughput, 6.25× better energy efficiency, and 4.06× better bandwidth efficiency.

When the HBM channel allocation is increased to 24, the accelerator reaches a peak of 60.55 GFLOP/s (30,204 MTEPS), which is 3.79× the performance of GraphLily under the same conditions.

Conclusions and Future Work
Serpens demonstrates that a carefully crafted HBM‑centric architecture, combined with on‑chip random‑access handling and index coalescing, can deliver SpMV performance competitive with GPUs while offering superior energy and bandwidth efficiency. The design is fully configurable, allowing scaling to different FPGA resources and HBM configurations. Future directions include extending the architecture to support other sparse kernels (SpMM, SpTS), exploring mixed‑precision and integer data types, and integrating dynamic workload scheduling for multi‑tenant data‑center environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment