Improving the Load Balancing Performance of Vlasiator
This whitepaper describes the load-balancing performance issues that are observed and tackled during the petascaling of the Vlasiator codes. Vlasiator is a Vlasov-hybrid simulation code developed in Finnish Meteorological Institute (FMI). Vlasiator models the communications associated with the spatial grid operated on as a hypergraph and partitions the grid using the parallel hypergraph partitioning scheme (PHG) of the Zoltan partitioning framework. The result of partitioning determines the distribution of grid cells to processors. It is observed that the partitioning phase takes a substantial percentage of the overall computation time. Alternative (graph-partitioning-based) schemes that perform almost as well as the hypergraph partitioning scheme and that require less preprocessing overhead and better balance are proposed and investigated. A comparison in terms of effect on running time, preprocessing overhead and load-balancing quality of Zoltan’s PHG, ParMeTiS, and PT-SCOTCH are presented. Test results on J"uelich BlueGene/P cluster are presented.
💡 Research Summary
The paper addresses a critical performance bottleneck in the Vlasiator code, a Vlasov‑hybrid plasma simulation framework developed at the Finnish Meteorological Institute. Vlasiator models the six‑dimensional phase space (three spatial dimensions plus three velocity dimensions) on a structured grid, requiring on the order of one million spatial cells and a comparable number of time steps for realistic near‑Earth space weather simulations. At petascale, the distribution of grid cells among thousands of MPI processes must be balanced both in computational load and communication volume.
Historically, Vlasiator has relied on the Parallel HyperGraph (PHG) partitioner from the Zoltan library. PHG builds a hypergraph where each hyperedge represents the set of processes that need to exchange data for a given cell, thereby capturing the true communication pattern. The partitioner then minimizes the cut weight, producing a mapping of cells to processes. While this approach yields high‑quality partitions, the authors observed that the partitioning phase consumes a substantial fraction of the total runtime—often more than 30 %—and that its cost grows sharply with the number of cores.
To mitigate this overhead, the authors explore graph‑based partitioning as a cheaper alternative. Graph partitioners such as ParMeTiS and PT‑SCOTCH do not model hyperedges explicitly; instead they assign scalar weights to edges that approximate communication volume. The authors argue that because Vlasiator’s grid is regular and the communication pattern is relatively uniform, a simple weight approximation is sufficient to produce partitions of comparable quality. They integrate ParMeTiS and PT‑SCOTCH into Vlasiator via the Zoltan interface, which requires converting Zoltan’s internal data structures to the format expected by the graph libraries and back after partitioning.
Experiments were conducted on the Jülich BlueGene/P system, scaling from 1 K to 4 K cores. Two problem sizes were used: a “weak‑scaling” configuration with 32 × 32 × 16 cells (≈16 K cells) and a “strong‑scaling” configuration with 32 × 32 × 1 6 cells (≈1 K cells). For each configuration the authors measured (1) preprocessing time (the partitioning step), (2) total execution time (including preprocessing), and (3) load‑balance quality metrics (computational load variance, total communication volume, and communication load balance).
Key findings include:
- Preprocessing overhead is dramatically lower for the graph‑based tools. PHG’s partitioning time is roughly 30 %–40 % higher than that of ParMeTiS or PT‑SCOTCH across all core counts.
- In weak‑scaling runs the total runtime is dominated by the simulation kernel, so differences between partitioners are modest; however, the runtime still shows a slight increase with core count regardless of the partitioner used.
- In strong‑scaling runs, where the problem size is fixed and the number of cores doubles, the overall runtime drops by about a factor of four, and PT‑SCOTCH consistently yields the greatest reduction (≈25 % faster than PHG).
- Load‑balance quality metrics are essentially identical for all three partitioners, confirming that the graph approximation does not degrade the quality of the partition in this application.
An important observation is that the Zoltan interface itself introduces a non‑trivial conversion cost: data structures must be transformed to Zoltan’s internal format, handed to the graph library, and then transformed back. This conversion accounts for a noticeable portion of the preprocessing time. The authors suggest that embedding native graph‑partitioning support directly into Vlasiator would eliminate this overhead and further improve scalability.
In conclusion, the study demonstrates that for Vlasiator—despite its complex six‑dimensional physics—the regularity of its computational mesh makes graph‑based partitioning a viable and more efficient alternative to hypergraph partitioning. The reduced preprocessing cost translates into measurable overall speed‑ups, especially in strong‑scaling scenarios. Future work will explore newer versions of ParMeTiS, deeper integration of PT‑SCOTCH, and a native graph‑partitioning implementation within Vlasiator to fully exploit the performance benefits uncovered in this work.
Comments & Academic Discussion
Loading comments...
Leave a Comment