Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Optimizing resource utilization in high-performance computing (HPC) clusters is essential for maximizing both system efficiency and user satisfaction. However, traditional rigid job scheduling often results in underutilized resources and increased job waiting times. This work evaluates the benefits of resource elasticity, where the job scheduler dynamically adjusts the resource allocation of malleable jobs at runtime. Using real workload traces from the Cori, Eagle, and Theta supercomputers, we simulate varying proportions (0-100%) of malleable jobs with the ElastiSim software. We evaluate five job scheduling strategies, including a novel one that maintains malleable jobs at their preferred resource allocation when possible. Results show that, compared to fully rigid workloads, malleable jobs yield significant improvements across all key metrics. Considering the best-performing scheduling strategy for each supercomputer, job turnaround times decrease by 37-67%, job makespan by 16-65%, job wait times by 73-99%, and node utilization improves by 5-52%. Although improvements vary, gains remain substantial even at 20% malleable jobs. This work highlights important correlations between workload characteristics (e.g., job runtimes and node requirements), malleability proportions, and scheduling strategies. These findings confirm the potential of malleability to address inefficiencies in current HPC practices and demonstrate that even limited adoption can provide substantial advantages, encouraging its integration into HPC resource management.

💡 Research Summary

This paper investigates the impact of introducing resource elasticity—specifically, malleable jobs that can change their allocated number of compute nodes at runtime—on the performance of high‑performance computing (HPC) clusters. Using real workload traces from three large‑scale supercomputers (Cori, Eagle, and Theta), the authors conduct a systematic simulation study with the open‑source ElastiSim framework.

The methodology begins with extensive preprocessing of the raw logs. In the case of the Cori Haswell partition, shared‑node jobs and daily split entries artificially inflate node utilization; these anomalies are merged and removed to obtain a realistic dataset that respects the actual 2,388‑node capacity. The other partitions (Cori KNL, Eagle, Theta) require minimal cleaning. Missing time limits are inferred as 12.5 % of the observed runtimes, and each job is enriched with minimum, maximum, and preferred node counts derived from a speed‑up model and efficiency thresholds taken from prior work.

Five scheduling policies are evaluated: (1) a baseline rigid EASY‑Backfill that treats all jobs as fixed‑size, and four malleable strategies originally proposed in earlier studies—Avg, Min, Pref—and a novel strategy called KeepPref. All malleable policies operate in three phases: (i) job start, where the number of nodes assigned is determined by the policy’s priority function; (ii) shrinking of running jobs when idle nodes are insufficient to start new jobs; and (iii) expanding running jobs to consume any remaining idle nodes. The priority functions differ: Avg uses the relative position within the allowed node range, Min focuses on surplus over the minimum, Pref targets the distance from a user‑specified preferred size, and KeepPref explicitly tries to keep each job at its preferred size, only shrinking jobs that exceed it.

Experiments vary the proportion of malleable jobs from 0 % to 100 % in 10 % increments. For each proportion, ten simulation runs with different random seeds are performed, and results are reported as averages with inter‑quartile ranges (IQR) to capture the non‑normal distribution of workload metrics. A warm‑up period of 12 hours and a drain‑down period after the last submission are excluded from the analysis. Tick intervals (1 s for the small Haswell workload, 10 s for larger workloads) approximate the overhead of node addition/removal (2–4 s) without explicitly modeling reconfiguration costs.

The results are strikingly consistent across all three supercomputers. In the Cori Haswell workload, increasing the malleable fraction reduces average job turnaround time from 2,391 s (all rigid) to 792 s at 100 % malleability—a 66 % improvement. Makespan shrinks by up to 65 %, and average wait time drops below 10 s once the malleable share exceeds 40 %. Node utilization climbs from 72 % to 99 %, indicating near‑full packing of the system. Similar trends appear in the KNL partition, where the KeepPref policy yields the highest utilization while maintaining low wait times. Eagle and Theta, despite having different job‑size distributions, also benefit: even a modest 20 % malleable mix yields 30 %–50 % reductions in wait time and 10 %–20 % gains in utilization.

Policy‑specific analysis shows that KeepPref generates the most expansion operations (≈11 per job) because it strives to keep jobs at their preferred size, whereas Avg performs a steadier number of expansions (≈5 per job). Shrink operations are most frequent at low malleable fractions (≈25 per job) and decline for Avg and Pref as the malleable share grows, while Min and KeepPref retain higher shrink rates, reflecting a more aggressive rebalancing approach.

The authors acknowledge that the simulation does not model detailed application‑level scaling costs, but the tick‑based overhead approximates realistic reconfiguration delays. They argue that even with additional real‑world costs, the observed performance gains would remain substantial, likely exceeding a 10 % net improvement.

Limitations are discussed: most MPI applications lack native support for dynamic node changes, and the study focuses solely on CPU nodes, excluding GPUs and storage. Future work should explore multi‑resource elasticity, develop standardized APIs for job resizing, and validate the findings on production systems.

In conclusion, the paper provides strong empirical evidence that introducing malleable jobs into HPC schedulers can dramatically reduce queue waiting times, shorten overall job execution, and raise node utilization. Importantly, the benefits appear even at low adoption levels (≈20 % malleable jobs), suggesting that incremental deployment of elasticity features could be a cost‑effective path toward more efficient supercomputing operations.

Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

💡 Research Summary

Comments & Academic Discussion

Leave a Comment