EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC

EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Resource disaggregation is a promising technique for improving the efficiency of large-scale computing systems. However, this comes at the cost of increased memory access latency due to the need to rely on the network fabric to transfer data between remote nodes. As such, it is crucial to ascertain an application’s memory latency sensitivity to minimize the overall performance impact. Existing tools for measuring memory latency sensitivity often rely on custom ad-hoc hardware or cycle-accurate simulators, which can be inflexible and time-consuming. To address this, we present EDAN (Execution DAG Analyzer), a novel performance analysis tool that leverages an application’s runtime instruction trace to generate its corresponding execution DAG. This approach allows us to estimate the latency sensitivity of sequential programs and investigate the impact of different hardware configurations. EDAN not only provides us with the capability of calculating the theoretical bounds for performance metrics, but it also helps us gain insight into the memory-level parallelism inherent to HPC applications. We apply EDAN to applications and benchmarks such as PolyBench, HPCG, and LULESH to unveil the characteristics of their intrinsic memory-level parallelism and latency sensitivity.


💡 Research Summary

The paper introduces EDAN (Execution DAG Analyzer), a software tool designed to evaluate memory‑latency sensitivity and memory‑level parallelism (MLP) of high‑performance computing (HPC) applications in the context of resource disaggregation. Disaggregated memory, while improving utilization and energy efficiency, inevitably adds network‑induced latency to every remote memory access. Quantifying how much an application suffers from this added latency is essential for both system architects (who must size interconnects, caches, and issue slots) and programmers (who must design latency‑tolerant algorithms).

Traditional approaches rely on custom hardware platforms or cycle‑accurate simulators such as gem5. Although accurate, these methods are prohibitively slow (often 100×–900× slower than native execution) and require substantial setup effort, making large‑scale parametric studies impractical. EDAN circumvents these drawbacks by exploiting a single execution trace of the target binary. The trace is collected with a QEMU user‑mode Tiny Code Generator (TCG) plugin, which translates each target instruction to host instructions on the fly, thereby achieving tracing speeds only 5×–10× slower than native execution—orders of magnitude faster than gem5.

From the instruction trace, EDAN constructs an execution Directed Acyclic Graph (eDAG). In this graph each vertex corresponds to a concrete instruction, and directed edges capture true data dependencies (read‑after‑write, write‑after‑read, write‑after‑write). Unlike static computational DAGs derived from source code, eDAGs reflect the exact runtime ordering and register renaming, thus preserving the real dependency structure. The authors formalize the eDAG as G = (V, E) and define two classic quantities: total work T₁ = Σ₍ᵥ∈V₎ t(v) (the sum of per‑instruction latencies) and span T∞ (the length of the critical path). The ratio T₁/T∞ gives the average parallelism available to the program.

The novel contribution lies in extending Brent’s lemma to memory accesses. The authors propose a memory‑cost model parameterized by the number of memory‑issue slots (pₘ) and a cache model (hit/miss latencies). The model yields a lower bound and an upper bound on the total memory time required to execute the eDAG, from which they derive two latency‑sensitivity metrics: (1) a bound on how much performance degrades per unit of added memory latency, and (2) a tolerance bound indicating how many extra cycles the program can absorb before becoming bottlenecked. These metrics are computed analytically from the eDAG topology and the cache‑hit statistics collected during tracing.

The EDAN toolchain consists of three stages: (1) program tracing (QEMU‑TCG plugin), (2) eDAG generation (Python parser that builds vertices, resolves dependencies, and simulates a simple cache hierarchy), and (3) performance analysis (application of the Brent‑derived formulas). The implementation targets RISC‑V because of its simplicity and growing ecosystem, but the modular design allows straightforward extension to other ISAs by swapping the tracer and adjusting the ISA‑specific vertex generation logic.

Experimental evaluation covers a representative set of HPC kernels: the full PolyBench suite (≈30 kernels), the HPCG benchmark, and the LULESH mini‑app. For each benchmark the authors vary the number of memory‑issue slots from 1 to 8 and also explore different L1/L2 cache sizes. The results illustrate clear patterns: kernels with high intrinsic MLP (e.g., dense matrix multiplication, SYRK, Gram‑Schmidt) show a steep reduction in span as issue slots increase, indicating low latency sensitivity. Conversely, kernels dominated by sequential memory streams (e.g., vector sum, Durbin) exhibit modest span reduction, confirming higher sensitivity. Cache enlargements improve hit rates and further shrink the span for many kernels, aligning the analytical bounds with observed speed‑ups.

Key strengths of EDAN are:

  1. Speed – a single traced execution suffices to explore many hardware configurations, eliminating the need for repeated slow simulations.
  2. Granularity – the eDAG captures true data dependencies, enabling precise identification of parallelism opportunities and bottlenecks.
  3. Analytical Insight – the Brent‑based model provides closed‑form bounds, offering designers immediate intuition about the impact of issue‑slot count, cache size, and added network latency.
  4. Open‑source orientation – built on RISC‑V and QEMU, the framework can be adapted to other architectures with modest effort.

Limitations include: the current methodology assumes a sequential (single‑threaded) execution model, so it does not directly account for contention, coherence traffic, or non‑deterministic interleavings present in multi‑core or distributed runs. Moreover, the cache model processes memory accesses in the order they appear in the trace, ignoring alternative reorderings that could affect miss rates; exhaustive exploration of all topological sorts would be computationally infeasible.

Future work suggested by the authors involves extending the eDAG to multi‑threaded programs (capturing synchronization edges), refining the handling of non‑true dependencies, integrating more sophisticated cache and network latency models (including queuing effects), and validating the predictions against real disaggregated‑memory prototypes.

In summary, EDAN offers a practical, fast, and analytically grounded approach to quantify memory latency sensitivity and MLP in HPC workloads, filling a gap between heavyweight simulators and coarse‑grained empirical methods, and providing actionable data for both software developers and hardware architects planning disaggregated memory systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment