Reading time: 20 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.18158
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

All-pairs shortest paths (APSP) is a fundamental algorithm used for routing, logistics, and network analysis, but the cubic time complexity and heavy data movement of the canonical Floyd-Warshall (FW) algorithm severely limits its scalability on conventional CPUs or GPUs. In this paper, we propose PIM-FW, a novel co-designed hardware architecture and dataflow that leverages processing in and near memory architecture designed to accelerate blocked FW algorithm on an HBM3 stack. To enable fine-grained parallelism, we propose a massively parallel array of specialized bit-serial bank PE and channel PE designed to accelerate the core min-plus operations. Our novel dataflow complements this hardware, employing an interleaved mapping policy for superior load balancing and hybrid in and near memory computing model for efficient computation and reduction. The novel in-bank computing approach allows all distance updates to be performed and stored in memory bank, a key contribution is that eliminates the data-movement bottleneck inherent in GPU-based approaches. We implement a full software and hardware co-design using a cycle-accurate simulator to simulate an 8-channel, 4-Hi HBM3 PIM stack on real road-network traces. Experimental results show that, for a 8, 192 ร— 8, 192 graph, PIM-FW achieves a 18.7ร— speedup in end-to-end execution, and consumes 3200ร— less DRAM energy compared to a state-of-the-art GPU-only Floyd-Warshall.

๐Ÿ“„ Full Content

All-pairs shortest paths (APSP) looks for the shortest path between every pair of vertices in a graph. It is commonly found in routing, logistics, and network analysis problems [1][2][3]. APSP is often solved using dynamic programming via Floyd-Warshall (FW) algorithm, but its ๐‘‚ (๐‘ 3 ) complexity and massive data movement severely limit its performance on conventional CPU/GPU platforms. GPU acceleration has improved performance some by using blocked, or tiled, implementations [4], the data-movement bottleneck remains a critical inhibitor. This "memory wall" necessitates a paradigm shift toward Processing-In-Memory (PIM) architectures. The blocked FW algorithm [5] is particularly well-suited for this approach, as its tiled structure exposes massive parallelism that aligns naturally with the inherent hierarchy of DRAM, offering a promising solution to the data-movement problem.

Despite significant GPU advancements-from initial CUDA ports [4,6] to tiled and multi-GPU cluster approaches-performance remains bottlenecked by persistent data movement overheads [7], which dominate the energy profile of conventional architectures. The energy cost of this data transfer is pronounced that for a large-scale graph, a PIM approach that performs computation in memory can achieve an energy reduction compared to a state-of-the-art GPU baseline. This staggering inefficiency highlights the critical need for PIM and Process-Near-Memory(PNM) solutions.

PIM and PNM offer a compelling solution to the memory wall by integrating compute primitives directly within the memory hierarchy [8][9][10][11]. The primary distinction between these paradigms lies in the physical placement of the compute logic, which creates a fundamental trade-off between data locality and resource sharing. The PIM approach places logic directly within a memory bank, offering maximum energy efficiency and bandwidth by performing computation with minimal data movement. Conversely, the PNM approach places logic near the memory banks, allowing it to be a shared resource for multiple bank-groups [12]. While this incurs some local data movement, it provides a centralized and efficient point for tasks that require aggregating information from multiple sources.

While the structured memory access of dense graphs appears well-suited for the massive parallelism of modern GPUs, in practice, performance is severely bottlenecked by the memory wall. The ๐‘‚ (๐‘ 3 ) complexity of the FW algorithm translates to an overwhelming number of read-modify-write operations that saturate the memory bus. For instance, our analysis shows that this persistent data movement between a GPU’s processing cores and its high-bandwidth memory is so costly that it accounts for the vast majority of the system’s energy consumption, leading to a stateof-the-art NVIDIA A100 GPU requiring over 3,200ร— more energy than our PIM-based approach . This bottleneck is architectural, not merely a matter of scale; even against a projected next-generation NVIDIA H100, our PIM architecture maintains a compelling 4.2ร— speedup, demonstrating that simply increasing bandwidth and compute resources in a conventional design yields diminishing returns for this memory-bound problem. This highlights the need for in-memory computation. However, prior PIM/PNM graph accelerators have overwhelmingly focused on sparse graph processing [13,14]. Many of these solutions, such as GraphR [14], adopt a vertex programming model which spreads out vertices across different memory locations. This is efficient for managing the irregular, pointer-chasing memory access patterns of sparse workloads (e.g., Breadth-First Search), but it is fundamentally inefficient for the dense, all-to-all workload of APSP. While The prior sparse-graph PIM accelerators adopt vertex programming mode which spreads out vertices across different memory locations. However, when processing fully dense graph applications (e.g., ASAP), the communication costs of transferring vertex data between different memory components become prohibitively large. This renders these sparse-optimized designs ill-suited for dense APSP, creating a key need for an architecture that addresses the unique challenges of dense, matrix-based graph algorithms.

Our work addresses this gap by proposing PIM-FW, the first DRAM-based architecture co-designed to accelerate blocked FW algorithm. The core insight of our work is a pragmatic hardwaresoftware co-design that divides the algorithm’s tasks based on their computational characteristics. On the hardware side, we propose a hybrid architecture: for the dominant and massively parallel ๐‘‚ (๐‘ 3 ) min-plus updates, we employ a pure PIM approach by embedding massive arrays of compact, bit-serial processing elements, which we term Bank PEs (BPEs), within each DRAM bank. This design choice performs computation , localizing the bulk of the operations to maximize parallelism and virtually eliminate the datamovement bottleneck inherent in GPU-based approaches. Conversely, for communication-intensive reduction tasks, we adopt a PNM model, utilizing a shared engine at the channel level, termed Channel PE (CPE), to efficiently aggregate results from across the bank-groups. The core BPE and CPE processing units share the same fundamental design; their primary distinction is their physical placement: BPEs are distributed in-bank for maximum locality, while the CPE is centralized for efficient cross-bank-group communication. On the software side, we complement this hybrid hardware with a novel interleaved data mapping policy that ensures superior load balancing . Through cycle-accurate simulation, this co-design demonstrates remarkable results: PIM-FW achieves an 18.7ร— speedup and an over 3,200ร— energy reduction compared to a state-of-the-art NVIDIA A100 GPU.

Our architecture, shown in Fig. 1, is built upon High-Bandwidth Memory (HBM3) [15], a 3D-stacked DRAM technology that provides massive bandwidth and enables fine-grained parallelism [16]. Its key architectural features, such as a wide data interface interconnected by Through-Silicon Vias (TSVs) and a deep hierarchy of independent channels and bank-groups, are fundamental to our design. This structure allows for concurrent operations across multiple channels and bank-groups, providing the necessary foundation for our accelerator. We propose PIM-FW that accelerates the blocked FW algorithm through a novel dataflow and a hybrid PIM architecture. We chose blocked FW due to its high parallelism, as discussed below in Section 2.1. The PIM-FW operation begins with a high-throughput bulk transfer of the tile-major ordered matrix into the HBM3 stack, leveraging the wide 1024-bit TSV interface. Building on this hierarchy, we integrate CPE at the channel level, responsible for global reduction tasks such as comparing minimum values across different bank-groups. The overall execution is orchestrated by a host CPU for initial data loading, which then delegates control to the main memory controller. This controller issues specialized commands to manage the algorithm’s phased execution, and its associated latency and energy overhead are fully accounted for in our cycle-accurate simulation. The detailed hardware implementation of our PIM-FW architecture is presented in Section 2.2. The specific data alignment and scheduling strategies designed to maximize parallelism are further detailed in Section 2.3.1.

The All-Pairs Shortest-Path (APSP) problem aims to find the shortest path distance for every pair of vertices (๐‘ข, ๐‘ฃ) โˆˆ ๐‘‰ , with the solution being an ๐‘ ร— ๐‘ distance matrix ๐ท. The canonical FW algorithm is a fundamental method for solving this problem [11,19].

The algorithm operates on a dense distance matrix, iteratively updating every path using the relaxation “D

)” [17]. The fundamental computation at the heart of this process is the min-plus operation.However, the naรฏve implementation’s ๐‘‚ (๐‘ 3 ) complexity and poor data locality create a severe data-movement bottleneck on conventional CPU and GPU platforms [6,7]. This bottleneck arises because for each min-plus operation in the innermost loop, the processor must read three values (‘D[i,j]’, ‘D[i,k]’, and ‘D[k,j]’) from and write one value back to main memory. For a graph of size N, this results in approximately 4 ร— ๐‘ 3 memory access operations. When N is large, the time and energy spent waiting for data to travel between the processor and main memory far exceed that of the actual computation, making memory bandwidth-not computational power-the limiting factor for performance.

To mitigate this, the blocked FW algorithm partitions the matrix into smaller ๐ต ร— ๐ต tiles. This tiling strategy enhances data locality by allowing data for multiple updates to be held in fast on-chip memory [18,20]. Furthermore, it exposes massive parallelism between independent tile updates, making it a foundational strategy for high-performance implementations and particularly well-suited for the PIM architectures we explore in this work. The procedure is detailed in Algorithm 1.

Require: Adjacency matrix ๐ท of size ๐‘ ร— ๐‘ , block size ๐ต Ensure: Matrix ๐ท with all-pairs shortest paths 1: Let ๐‘€ = ๐‘ /๐ต and let ๐ด ๐‘– ๐‘— be the (๐‘–, ๐‘—)-th block of ๐ท 2: for ๐‘˜ โ† 0 to ๐‘€ -1 do

โŠฒ Phase 1: Update pivot block 4:

โŠฒ Phase 2: Update pivot row and column 6:

For each ๐‘– โˆˆ {0..๐‘€ -1}, ๐‘– โ‰  ๐‘˜: ๐ด ๐‘–๐‘˜ โ† min(๐ด ๐‘–๐‘˜ , ๐ด ๐‘–๐‘˜ โŠ•๐ด ๐‘˜๐‘˜ ) The in-memory processing of our architecture employs a sophisticated hybrid model, strategically combining the strengths of PIM and PNM to optimize different stages of the blocked FW algorithm based on their computational characteristics.

The bulk of the FW algorithm is dominated by ๐‘‚ (๐‘ 3 ) min-plus computations, which are massively parallel and data-local. To tackle this, we adopt a PIM approach: as shown in Fig. 2, each DRAM bank is equipped with an array of 16 BPEs. This design minimizing data movement for the core workload and maximizing energy efficiency and parallelism.

PNM(CPE) for communication-intensive reductions: Conversely, stages such as finding a minimum value across different bank-groups for a global reduction are inherently communicationintensive. For such tasks, a centralized, more capable unit is more efficient. We therefore employ a CPE model Fig. 1, utilizing a shared engine at the channel level. This CPE engine is responsible for efficiently aggregating results from its constituent bank-groups, avoiding the significant overhead of coordinating complex cross-bank communication among thousands of simple in-bank BPEs.

This pragmatic division of labor allows us to match each stage of the algorithm to its most efficient processing paradigm, which is the key novelty of our PIM-FW architecture.

The interaction between a BPE and the DRAM components for a single ‘min-plus’ operation follows a precise sequence. The process begins when the DRAM controller opens a wordline and latches the row’s 8192 bits of data into the row buffer via the bitlines and sense-amplifiers. Once the data is latched, each of the 16 BPEs within the bank reads its required 32-bit operand from a dedicated 512-bit slice of the row buffer’s internal datalines. To support the ‘min-plus’ operation, as detailed in Fig. 2, the BPE logic first uses a bit-serial adder to compute the sum (๐ท ๐‘ ๐‘ข๐‘š = ๐ท ๐‘–๐‘˜ + ๐ท ๐‘˜ ๐‘— ) and then uses a bit-serial adder and multiplexer to efficiently perform the comparison and select the minimum value. The final 32-bit result (๐ท new ) is then written back to its corresponding slice in the row buffer, overwriting the old data.

Conversely, for the communication-intensive reduction phase, we utilize an CPE paradigm. A global processing engine at the channel-level controller collects the local minima from its constituent bank-groups via the on-die interconnects. It uses an internal comparison tree to find the channel’s overall minimum in approximately 5-10 cycles. The final global minimum is then found by a reduction across all channel-level results at the main memory controller. This hybrid approach uses the most efficient paradigm for each specific task. Fig. 3 (b) further clarifies the multi-level dependency structure of our blocked FW implementation. The execution of each pivot iteration is governed by a three-phase data dependency. Phase 1 first updates the pivot tile ๐ด ๐‘˜๐‘˜ in isolation. In phase 2, the results from ๐ด ๐‘˜๐‘˜ are broadcast to all tiles in the pivot row (๐ด ๐‘˜ ๐‘— ) and column (๐ด ๐‘–๐‘˜ ), which can then be processed in parallel. Finally, phase 3 updates the remaining off-diagonal tiles (๐ด ๐‘– ๐‘— ) in a parallel wavefront, using data broadcast from the tiles computed in phase 2 -specifically, rows from ๐ด ๐‘–๐‘˜ and columns from ๐ด ๐‘˜ ๐‘— . This entire flow is enabled by the HBM3’s wide 1024-bit TSV interface, which can broadcast entire Belement pivot vectors to all target bank-groups in just a few cycles, unlocking the parallelism in both phases 2 and 3. Within each phase, all white (off-diagonal) entries across different sub-blocks are free of inter-block dependencies and therefore can be executed fully in parallel.

Fig. 3 (c) shows the execution of a parallel update in our hybrid architecture is detailed. Pivot data from source bank-groups (holding ๐ด ๐‘–๐‘˜ and ๐ด ๐‘˜ ๐‘— ) is broadcast to the compute bank-group holding ๐ด ๐‘– ๐‘— . Inside the compute bank, BPEs perform the min-plus operation, while the CPE handles communication-intensive reduction tasks.

Coarse and Fine-Grained Parallelism. Our architecture and mapping policy are co-designed to exploit parallelism at two distinct granularities:

โ€ข Coarse-grained parallelism: At the highest level, our design leverages the independence of HBM3 channels. Since our mapping policy distributes tiles across all channels, multiple independent tile updates (e.g., during phase 3 wavefronts) can occur simultaneously in different channels, each with its own controller and data path. This channel-level parallelism is essential for achieving high throughput on the overall problem.

โ€ข Fine-grained parallelism: Within each bank-group, our design achieves fine-grained parallelism via the in-bank PE arrays. A single bank-group contains a total of 256 PEs (16 banks ร— 16 PEs/bank). This number is deliberately chosen to match our tile dimension (๐ต = 256). This allows an entire 256-element row of a tile to be processed in a single computational pass, with each PE handling one element. This internal parallelism is what allows each tile-level update to be executed with efficiency. By combining this mapping policy with a hardware design that supports both coarse-and fine-grained parallelism, we can fully exploit the dependencies of the blocked FW algorithm to achieve significant performance gains.

To efficiently map the blocked Floyd-Warshall algorithm onto our hardware, we devised a multi-phase schedule, illustrated in Fig. 4. This schedule is a necessary adaptation to impose the algorithm’s logical 8x8 tile grid onto the physical 4x8 HBM3 organization, which consists of 4 bank-groups per channel across 8 independent channels. The process begins in phase 1 with the self-contained update of the pivot block, ๐ด ๐‘˜๐‘˜ . In phase 2, the updated pivot data is broadcast to all other tiles in its corresponding row and column. Due to the architecture, broadcasting is parallel across different channels but sequential within a single channel. This constraint means the inter-bankgroup broadcast (represented by the orange arrow) requires three broadcast cycles. In contrast, the in-bankgroup broadcast (represented by the green arrow) and the inter-channel broadcast (represented by the blue arrow) are each completed in a single cycle. Following the broadcast, phases 3 and 4 show the update of the tiles in the pivot column and row, respectively. Finally, phases 5 and 6 show the remaining off-diagonal tiles being updated in a wavefront pattern.

To evaluate PIM-FW, we use the Ramulator2-based[28] cycle-accurate model to simulate our HBM3-based design. Our implementation targets a standard HBM3 stack with the following configuration: 8 channels per ie, each channel comprising 16 banks organized into 4 bank-groups; a 4-Hi die stack with 32k rows per bank, 1 KB row size and 512ร—512-cell subarrays; a 256-bit wide DQ bus yielding 1024 TSV lanes and a peak bandwidth of 256 GB/s (2 Gbps/pin at 1.2 V). The DRAM timing parameters used in our simulation are ๐‘ก ๐‘…๐ถ = 30 ns, ๐‘ก ๐‘…๐ถ๐ท = 8 ns, ๐‘ก ๐‘…๐ด๐‘† = 24 ns, ๐‘ก ๐‘…๐‘…๐ท = 2 ns, ๐‘ก ๐‘Š ๐‘… = 12 ns, ๐‘ก ๐ถ๐ถ๐ท๐‘† = 2 ns, and ๐‘ก ๐ถ๐ถ๐ท๐ฟ = 4 ns,the detailed parameters for our simulated HBM3 stack and PIM engines are summarized in Table 1. This hardware configuration, particularly the 32 total bank-groups, defines our maximum parallelism (๐‘ max,par ) and thus limits the maximum graph size to ๐‘ = 8, 192 (where ๐‘€ โ‰ค 16 and ๐ต = 512), which we use as the largest dataset in our evaluation. We estimate the HBM area using CACTI-3DD [30] at 22nm technology node. The area and power of our PEs are obtained by synthesizing the RTL using a 65nm library, with the results subsequently scaled to 22nm. For the GPU baseline, we report the total energy consumed during execution, measured using the nvidia-smi utility. This provides a comprehensive measure of the entire board’s power draw, ensuring a fair system-level comparison.

Datasets: We evaluate PIM-FW across three real-world graphs from the SNAP dataset [23] and OpenStreetMap [29]: ca-GrQc (๐‘ = 5, 242), p2p-Gnutella08 (๐‘ = 6, 301) , and OpenStreetMap (๐‘ = 8, 192). Our evaluation uses inherently dense graphs; to align with the FW algorithm’s requirements, we represent non-existent edges with infinite weight. While specialized algorithms exist for sparse graphs [31], the dense FW approach is fundamental for problems where graphs are dense or edge weights change frequently, making re-computation on a dense matrix more practical.

GPU baselines: Our primary baseline is a highly-optimized blocked FW implementation on an NVIDIA A100 (SXM4, 40GB) GPU. Furthermore, we provide a performance projection for the NVIDIA H100 GPU [22]. This projection is a conservative estimate based on the H100’s documented architectural advantages, such as significantly higher memory bandwidth and more streaming multiprocessors.

Table 1: parameter for the APSP in HBM3

To validate the performance of PIM-FW on practical applications, we evaluate our design on three large-scale, real-world datasets, specifically chosen to represent diverse network topologies: the ca-GrQc collaboration network (๐‘ = 5, 242), the p2p-Gnutella08 peer-to-peer network (๐‘ = 6, 301) [23,29], and the OpenStreetMap road network (๐‘ = 8, 192). This dataset is specifically chosen to evaluate the boundaries of our design, as ๐‘ max,par = 8, 192, which, in turn, is determined by the design’s parallelism constraints: our HBM3 stack provides 32 bank-groups, and since the blocked FW algorithm’s wavefront loads 2๐‘€ tiles simultaneously, the maximum ๐‘€ is limited to 16 (2๐‘€ โ‰ค 32). Combined with a block parameter ๐ต = 512, this yields the ๐‘ = 8, 192 limit.

The selection of these distinct graph structures is intended to demonstrate the broad applicability and robustness of our architecture across varied computational workloads, proving its relevance to real-world scenarios. In our evaluation, these inherently sparse graphs are treated as complete graphs where non-existent edges are assigned an infinite weight to align with the requirements of a dense APSP accelerator. Performance projections, extrapolated from our measured results, are presented in Fig. 5. Notably, for the all dataset, by implementing our mapping policy, we observed a 3.2ร— speedup. With hardware implementing it yielded an additional 5.8ร— speedup. Our PIM-based architecture achieves a significant speedup of approximately 18.7ร— over the state-of-the-art NVIDIA A100 GPU. This consistent high performance across topologically different datasets underscores the capability of our PIM-based design to efficiently manage graph analytics workloads.

We assess our architecture’s scalability by projecting performance across various graph sizes based on the Floyd-Warshall algorithm’s ๐‘‚ (๐‘ 3 ) complexity. Using our measured runtimes on the N=8,192 dataset (592s for PIM-FW vs. 11,097s for the A100 GPU) as a baseline, our projections confirm that PIM-FW’s significant performance advantage is consistently maintained across all scales. For instance, at N=1,000, PIM-FW is projected to require just 1.08 seconds, while the A100 GPU would take 20.2 seconds, demonstrating robust scalability for real-world applications.

For a large-scale graph of ๐‘ = 8, 192, the total execution time of our proposed architecture is 592 seconds, while the NVIDIA A100 GPU takes 11,097 seconds to complete the same task. This represents a significant speedup of 18.7x over the GPU baseline. As shown in Fig. 6, this performance advantage holds across all tested real-world datasets.

The energy efficiency of our design is even more pronounced. Across all datasets, the energy required by the GPU is several orders of magnitude higher than that of our PIM design, corresponding to a normalized energy reduction factor of over 3,200x. This exceptional efficiency stems directly from our PIM-based approach, which nearly eliminates the costly data movement between processors and memory that dominates the energy profile of conventional architectures.

While our primary baseline is the A100, we also provide a performance projection for the NVIDIA H100 GPU-Estimate. This projection is an estimate based on the H100’s documented architectural advantages, including significantly higher memory bandwidth and more streaming multiprocessors. Based on these factors, we conservatively estimate the H100 to be approximately 4.5x faster than the A100 [22] on this workload. Even against this projected baseline, PIM-FW would maintain a compelling 4.2x speedup. The PIM-FW architecture demonstrates exceptional efficiency in both energy consumption and silicon area overhead, the feasibility of our approach. Cycle-accurate simulations reveal over 3,200ร— energy reduction against state-of-the-art GPUs, with BPEs consuming the majority of compute energy for the core ๐‘‚ (๐‘ 3 ) updates. This validates PIM-FW’s feasibility and superior efficiency, leveraging the architectural alignment of blocked FW with DRAM’s physical hierarchy. The area overhead shows our added PIM logic (BPE+CPE) is a negligible fraction of the total HBM3 logic die. The majority of the die area is consumed by standard HBM components, including channel/bank-group controllers, TSV interconnects, and other peripheral circuits for functions like power delivery. The vast majority of a typical HBM3 logic die is occupied by standard components. Our proposed PIM logic constitutes less than 1% of the total die area. Within this added PIM logic, the CPE and their interconnect logic account for ~76% of the area, while the massive array of 8,192 highly-compact PIM BPEs accounts for the remaining ~24%.

The architecture’s coarse-grained parallelism scales near-linearly with the number of independent HBM3 channels. Since each channel operates independently, a higher channel count directly enables more concurrent tile updates and parallel data distribution via the TSV bus, mitigating serialization bottlenecks during broadcastheavy phases. As shown in the provided Fig. 6 (upper right), scaling from 4 to 32 channels reduces the total process time by approximately 8x. This result underscores the design’s scalability with future HBM standards that may feature more channels. Fine-grained performance is dictated by BPE-level parallelism within each bank-group. Our baseline design is optimized by matching the number of BPEs per bank-group (256) to the tile width (๐ต = 256), which allows a 256-element tile row to be processed in a single computational pass. Halving the number of BPEs to 128 doubles the required passes, increasing execution time by ~1.45x, while quartering them to 64 increases it by ~2.35x. Conversely, increasing the BPEs to 512 provides no additional speedup, as the extra hardware would sit idle. Therefore, 256 BPEs per bank-group represents the optimal design point, maximizing computational throughput without wasting silicon resources. We could continue to increase the BPE count, which might allow processing larger tile widths dataset, but for the overall acceleration of the current workload, this would have no significant effect. We also evaluate PIM-FW against other heterogeneous APSP accelerators. For instance, recent work on an FPGA-based solution [32] for N=8192 graphs reported a 137ร— speedup, but this was compared to a CPU-only baseline, with authors noting their performance was near-GPU levels. In sharp contrast, PIM-FW achieves its 18.7ร— speedup compared to the state-of-the-art NVIDIA A100 GPU baseline [27]. Given that the FPGA solution only approaches GPU performance, therefore PIM-FW is at least 18.7ร— faster than the FPGA solution [32]. Interestingly, both approaches converge on a heterogeneous solution to solve APSP: the FPGA work utilizes a CPU-FPGA co-design, while our PIM-FW implements a hybrid PIM/PNM architecture (BPE/CPE) to manage the computation and the communication bottleneck.

In this paper, we addressed the significant data movement bottleneck of the blocked Floyd-Warshall algorithm by introducing PIM-FW, a novel hardware-software co-design for dense APSP graphs. The key to our approach is a pragmatic hybrid architecture: massively parallel in-bank Processing Elements (PIM) handle the core ๐‘‚ (๐‘ 3 ) computations, while a more powerful channellevel engine (PNM) manages communication-intensive reduction tasks, all complemented by a static, interleaved data mapping policy. OurRamulator2-based[28] cycle-accurate evaluation demonstrates the efficacy of this co-design, showing that PIM-FW achieves a remarkable 18.7ร— speedup and over 3,200ร— energy reduction compared to a highly-optimized NVIDIA A100 GPU baseline, while maintaining a compelling 4.2ร— speedup against a projected nextgeneration NVIDIA H100. Furthermore, our analysis indicates that while the current BPE count is optimal for our baseline, increasing BPE parallelism could enable the processing of even larger datasets by supporting wider computational tiles (๐ต). These results show that our co-designed PIM approach, which closely couples computation with a tailored data placement strategy, is a powerful and a viable solution for accelerating memory-intensive dynamic programming workloads.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut