Exploring Performance-Productivity Trade-offs in AMT Runtimes: A Task Bench Study of Itoyori, ItoyoriFBC, HPX, and MPI
Asynchronous Many-Task (AMT) runtimes offer a productive alternative to the Message Passing Interface (MPI). However, the diverse AMT landscape makes fair comparisons challenging. Task Bench, proposed by Slaughter et al., addresses this challenge through a parameterized framework for evaluating parallel programming systems. This work integrates two recent cluster AMTs, Itoyori and ItoyoriFBC, into Task Bench for comprehensive evaluation against MPI and HPX. Itoyori employs a Partitioned Global Address Space (PGAS) model with RDMA-based work stealing, while ItoyoriFBC extends it with futurebased synchronization. We evaluate these systems in terms of both performance and programmer productivity. Performance is assessed across various configurations, including compute-bound kernels, weak scaling, and both imbalanced and communication-intensive patterns. Performance is quantified using application efficiency, i.e., the percentage of maximum performance achieved, and the Minimum Effective Task Granularity (METG), i.e., the smallest task duration before runtime overheads dominate. Programmer productivity is quantified using Lines of Code (LOC) and the Number of Library Constructs (NLC). Our results reveal distinct trade-offs. MPI achieves the highest efficiency for regular, communication-light workloads but requires verbose, lowlevel code. HPX maintains stable efficiency under load imbalance across varying node counts, yet ranks last in productivity metrics, demonstrating that AMTs do not inherently guarantee improved productivity over MPI. Itoyori achieves the highest efficiency in communication-intensive configurations while leading in programmer productivity. ItoyoriFBC exhibits slightly lower efficiency than Itoyori, though its future-based synchronization offers potential for expressing irregular workloads.
💡 Research Summary
This paper presents a comprehensive performance and productivity evaluation of four parallel programming systems—MPI, HPX, and the newly integrated cluster‑level AMT runtimes Itoyori and ItoyoriFBC—using the Task Bench benchmark suite. Task Bench generates synthetic task graphs with configurable dependency patterns (stencil, spread, all‑to‑all), task granularity, and compute intensity, allowing the isolation of runtime characteristics such as scheduling overhead and communication latency.
All four implementations were run on the Goethe‑NHR supercomputer (dual Intel Xeon Gold 6148 nodes, InfiniBand interconnect). MPI serves as the baseline static, bulk‑synchronous implementation; HPX provides a C++‑standard AMT with intra‑node threading and dynamic work stealing; Itoyori combines a PGAS memory model with RDMA‑based random work stealing (RDWS) and a nested fork‑join programming style; ItoyoriFBC replaces the fork‑join model with future‑based cooperation (FBC), enabling explicit DAG‑style dependencies.
Performance metrics include application efficiency (achieved FLOP/s divided by theoretical peak) and Minimum Effective Task Granularity (METG), defined as the smallest task duration that retains 50 % efficiency. Productivity is measured by Lines of Code (LOC) and Number of Library Constructs (NLC). Results show that MPI achieves the highest efficiency for regular, communication‑light workloads but suffers dramatically under load imbalance, dropping to about 85 % efficiency at the worst imbalance factor. HPX maintains stable efficiency under imbalance thanks to its work‑stealing scheduler, yet it records the highest LOC (224) and NLC (23), indicating low programmer productivity.
In communication‑intensive scenarios (spread and all‑to‑all graphs), Itoyori outperforms all others, reaching over 89 % efficiency even with 40 dependencies, thanks to its PGAS cache and RDMA‑driven task distribution. ItoyoriFBC attains slightly lower efficiency but offers future‑based synchronization that can naturally express irregular DAGs, suggesting potential for further gains if the imposed barrier between time steps is removed.
METG analysis reveals that Itoyori and ItoyoriFBC require coarser task granularity (≈2¹⁴ iterations) to mask the ≈3 µs work‑stealing latency, whereas MPI and HPX sustain efficiency at finer granularity due to static partitioning and shared‑memory scheduling.
Productivity measurements show Itoyori’s checkout/checkin API dramatically reduces code size to 77 LOC and 14 NLC, roughly a 50 % reduction compared with MPI’s 137 LOC and 11 NLC. ItoyoriFBC’s future‑based model results in 115 LOC but maintains a low NLC of 11. HPX’s reliance on explicit MPI calls inflates both LOC and NLC.
The study concludes that AMT runtimes do not automatically provide higher productivity than MPI; the choice of runtime must consider workload characteristics. MPI remains optimal for regular, low‑communication tasks; HPX excels under load imbalance but at a productivity cost; Itoyori offers the best balance of efficiency and programmer ease for communication‑heavy applications; and ItoyoriFBC, while slightly less efficient, provides a flexible future‑based model suited to irregular workloads. Future work should explore barrier‑free execution in ItoyoriFBC and alternative HPX parcel ports to further improve scalability and productivity.
Comments & Academic Discussion
Loading comments...
Leave a Comment