ALPHA-PIM: Analysis of Linear Algebraic Processing for High-Performance Graph Applications on a Real Processing-In-Memory System
Processing large-scale graph datasets is computationally intensive and time-consuming. Processor-centric CPU and GPU architectures, commonly used for graph applications, often face bottlenecks caused by extensive data movement between the processor and memory units due to low data reuse. As a result, these applications are often memory-bound, limiting both performance and energy efficiency due to excessive data transfers. Processing-In-Memory (PIM) offers a promising approach to mitigate data movement bottlenecks by integrating computation directly within or near memory. Although several previous studies have introduced custom PIM proposals for graph processing, they do not leverage real-world PIM systems. This work aims to explore the capabilities and characteristics of common graph algorithms on a real-world PIM system to accelerate data-intensive graph workloads. To this end, we (1) implement representative graph algorithms on UPMEM’s general-purpose PIM architecture; (2) characterize their performance and identify key bottlenecks; (3) compare results against CPU and GPU baselines; and (4) derive insights to guide future PIM hardware design. Our study underscores the importance of selecting optimal data partitioning strategies across PIM cores to maximize performance. Additionally, we identify critical hardware limitations in current PIM architectures and emphasize the need for future enhancements across computation, memory, and communication subsystems. Key opportunities for improvement include increasing instruction-level parallelism, developing improved DMA engines with non-blocking capabilities, and enabling direct interconnection networks among PIM cores to reduce data transfer overheads.
💡 Research Summary
The paper presents ALPHA‑PIM, the first comprehensive framework that implements linear‑algebraic graph algorithms on a commercially available processing‑in‑memory (PIM) system, namely UPMEM. Recognizing that modern graph workloads such as BFS, SSSP, and Personalized PageRank are memory‑bound due to low arithmetic intensity and irregular access patterns, the authors re‑express these algorithms as repeated sparse matrix‑vector multiplications. While traditional CPU and GPU solutions rely on dense‑vector SpMV, the authors observe that in many iterations the input vector is highly sparse, making dense transfers wasteful. To address this, they develop a suite of sparse‑matrix‑sparse‑vector (SpMSpV) kernels, exploring three compressed matrix formats (CSR, CSC, COO) and three partitioning strategies (row‑wise 1D, column‑wise 1D, 2D tiling).
A systematic design‑space exploration across 18 kernel variants identifies the most efficient combination for each graph dataset. An empirical cost model selects between SpMV and SpMSpV at runtime based on the current vector density. Experiments on a 2 048‑DPU configuration (64 MB MRAM per DPU, 24 KB IRAM, 64 KB WRAM) show that the best partitioning can be up to 25× faster than the worst, and that ALPHA‑PIM achieves overall speedups of 2.6× (BFS), 10.4× (SSSP), and 1.7× (PPR) over a high‑end CPU baseline, with comparable or better utilization than state‑of‑the‑art GPUs.
Performance profiling reveals three dominant bottlenecks: (i) computation‑side stalls caused by the DPU’s 14‑stage “revolver” pipeline, which forces a 11‑cycle gap between successive instructions of the same thread and limits instruction‑level parallelism; (ii) memory‑side stalls due to blocking DMA transfers between MRAM and WRAM, especially when loading dense input vectors; and (iii) communication‑side overhead because DPUs cannot communicate directly, forcing all intermediate results to be merged on the host CPU. The authors argue that these limitations stem from architectural choices rather than algorithmic inefficiencies.
Consequently, they propose concrete hardware enhancements: adding forwarding paths and multi‑issue capabilities to the pipeline to alleviate structural hazards; implementing non‑blocking, multi‑channel DMA engines to overlap computation with data movement; and providing a direct inter‑DPU network (e.g., a mesh or ring) to enable on‑chip reduction and eliminate host‑mediated merges. On the software side, they suggest adaptive format selection based on degree distribution, dynamic tiling, and automated tuning of partition sizes.
In summary, ALPHA‑PIM demonstrates that real‑world PIM hardware can substantially accelerate linear‑algebraic graph processing, quantifies the current architectural bottlenecks, and offers a clear roadmap for future PIM designs that combine richer compute pipelines, smarter memory subsystems, and on‑chip communication fabrics.
Comments & Academic Discussion
Loading comments...
Leave a Comment