Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems
Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case stu…
Authors: Glen MacLachlan, Joseph Creech, Rubeel Muhammad Iqbal
Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems Glen MacLachlan ∗ maclach@gwu.edu The George W ashington University W ashington, DC, USA Joseph Creech ∗ jcreech@gwu.edu The George W ashington University W ashington, DC, USA Rubeel Muhammad Iqbal rubeel@gwu.edu The George W ashington University W ashington, DC, USA Clark Gaylord cgaylord@gwu.edu The George W ashington University W ashington, DC, USA Jake Messick jake_messick@gwu.edu The George W ashington University W ashington, DC, USA Abstract Migrating heterogeneous high-performance computing (HPC) sys- tems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workows. This paper presents a case study of transitioning a production academic HPC cluster from node- exclusive to consumable r esource scheduling mid-lifecycle, without disrupting active workloads. W e describe an operational strategy combining a time-bounded compatibility layer , observability-driven feedback, and targeted user engagement to guide adoption of ex- plicit resource declaration. This appr oach protected active research workows throughout the transition, av oiding the disruption that a direct cut-over would have imposed on the user community . Fol- lowing deployment, median queue wait times fell fr om 277 minutes to under 3 minutes for CP U workloads and from 81 minutes to 3.4 minutes for GP U workloads. Users who adopted TRES-based submission exhibited strong long-term retention. These results demonstrate that successful scheduling transitions depend not only on system conguration, but on aligning obser vability , user engage- ment, and operational design. CCS Concepts • Distributed computing methodologies ; • Scheduling ; • Net- work monitoring ; Ke y wor ds HPC, Slurm, TRES, resource scheduling, cluster operations, observ- ability A CM Reference Format: Glen MacLachlan, Joseph Creech, Rub eel Muhammad Iqbal, Clark Gaylord, and Jake Messick. 2026. Operational Strategies for Non-Disruptive Schedul- ing Transitions in Production HPC Systems. In . ACM, New Y ork, N Y, USA, 4 pages. https://doi.org/XXXXXXX.XXXXXXX ∗ Both authors contributed equally to this research. Conference’17, W ashington, DC, USA 2026. ACM ISBN 978-x-xxxx-xxxx-x/Y YY Y/MM https://doi.org/XXXXXXX.XXXXXXX 1 Introduction Ecient resource allocation in heterogeneous high performance computing (HPC) systems relies on scheduling policies that ex- plicitly expose compute resources, including CP Us, memory , and GP Us, to b oth the scheduler and users, a challenge widely studie d in parallel job scheduling [1]. This paper presents a production case study of migrating a het- erogeneous HPC cluster from a node-exclusive to consumable re- source scheduling using Slurm T rackable Resources (TRES) [ 5 , 8 ]. The emphasis is on both the scheduler conguration and an op- erational strategy for deploying potentially disruptive changes in a live production environment while maintaining uninterrupted service. The Ge orge W ashington Univ ersity’s agship HPC system, Pega- sus [ 2 ], is a production cluster supporting a large interdisciplinary research community . At the time of migration, Pegasus comprised approximately 205 compute nodes (42 GP U nodes); 8,600 CP U cores; 280 TB aggregate system memory; 2 PB of storage; and 800 active researchers. Over time, the node-exclusiv e scheduling model produced sev- eral structural ineciencies including suboptimal node packing, increased queue wait times, and misaligned resource allo cation patterns. This exposed both technical and behavioral challenges: submission patterns created measurable ineciency , and transi- tioning researchers to ward explicit resour ce declaration required deliberate operational design. W e present a production deployment strategy that combines phased transition with observability-driven fee dback to enable this transition without disrupting active workloads. Aligning scheduler conguration with researcher-facing operational practices leads to substantial improvements in queue wait times, resource utilization, and sustained adoption of explicit resource requests. T o address the technical component, the cluster migrated to Slurm’s Trackable Resources model with cgroup enforcement and GP U accounting through Slurm’s Generic Resource (GRES) framework [ 6 , 8 ]. While the technical implementation itself is well documented, the opera- tional challenge centered on user adoption. The remainder of the paper focuses on the transition strategy , observability me chanisms, and operational lessons learned during the migration. Conference’17, July 2017, W ashington, DC, USA MacLachlan et al. 2 Legacy Model Friction and Transition Strategy Under Pegasus’s original node-exclusive scheduling discipline, the only mechanism available to dierentiate resource classes was parti- tion assignment. O ver time , this produced a proliferation of special- ized partitions: high-memory nodes, GP U variants, debug queues, visualization nodes, and high-throughput congurations. Each addi- tional partition added incrementally to expose resource capabilities, while also increasing the perceived complexity of the submission environment for users. Under consumable resource scheduling, such distinctions are expressed through per-job resour ce requests rather than partition identity . The resulting complexity reinforced several behaviors: • Partition selection based on inferred resource availability rather than resource-as-declaration • GP U allocation without corresponding GP U utilization • Low core utilization per node due to implicit node exclusivity assumptions Pre-migration telemetry r evealed consistent patterns of lo w CP U packing eciency , frequent allocation of GP U resources without uti- lization, and queue congestion driv en by node exclusivity . These in- eciencies are consistent with known limitations of node-exclusive scheduling, including resource fragmentation and suboptimal uti- lization. Similar ineciencies have been observed in large-scale GP U cluster environments, where mismatches between requested and actual resource usage lead to under-utilization [ 3 ]. These obser- vations motivated the nee d for a transition strategy that addressed both scheduling policy and user submission b ehavior . Slurm’s T rackable Resources (TRES) model, implemented via the select/cons_tres plugin, is enabled or disable d cluster-wide [ 5 ]. As a result, changes to the scheduling model must be applied uni- formly , requiring operational strategies to support user transition in a live pr oduction environment. A direct cutov er was therefore rejected in favor of an approach that would allow the research community to adapt without interruption to active workloads. T o reduce user friction, we implemented a 90-day transition pe- riod during which legacy submission patterns were permitted but monitored. This approach preserved continuity of service while al- lowing users to adapt to e xplicit resour ce declarations. Lightw eight wrapper scripts translated legacy job submissions into resource- aware equivalents where possible, enabling existing workows to continue functioning while guiding users towar d explicit CP U, memory , and GP U requests. These wrapp ers w ere time-bounded by the 90-day window , giving users a structured path towar d native TRES-based submission patterns. During the transition, legacy submission patterns were p ermitted but monitored and users received targeted feedback on historical ineciencies. T o ease user adoption, do cumentation was produced that emphasized resource declaration ov er partition identity , and oce hours and town halls contextualized the change in terms of throughput and fairness. 3 Observability as Stabilization Me chanism System observability played a critical role in evaluating the im- pact of the scheduling migration. Monitoring combined Slurm [ 7 ] accounting data with Zabbix infrastructure telemetry [ 9 ]. Slurm ac- counting records provided job-level telemetry including requested Figure 1: CP U workloads running on a GP U node, under- utilizing the available GP U devices. resources, allocated resources, job duration, and queue wait time . Zabbix agents collecte d no de-level metrics including memory usage, GP U utilization, and CP U core utilization. Operational metrics tracked included: • Queue wait time percentiles by partition • CP U packing eciency • GP U allocation alignment • Memory utilization relative to requested memory These metrics enabled targeted feedback to users and supporte d rapid behavioral adjustment. For example, real-time Zabbix visual- izations of GP U under-utilization were used during town halls and one-on-one consultations to demonstrate mismatches between re- quested and actual resource usage. This visibility helpe d researchers understand their own resource usage patterns and guide d them toward more eective job submissions. Figure 1 shows three separate legacy jobs running on a single GP U node over a ve-day perio d. Despite occupying a GP U-capable node, these jobs exhibit negligible or no GPU utilization, indicating that GP U resources were allocated but not consumed. This pattern reects submission strategies that prioritized rapid job start times over appropriate resource matching, resulting in inecient use of scarce GP U capacity . By contrast, under TRES-based schedul- ing, such mismatches are explicitly surfaced and discouraged, as resource requests must align with actual usage. 4 Measured Eects Outcomes were evaluated across two phases: a pre-migration base- line and a 90-day transition window . During these phases, we tracked changes in queue wait times as well as user adoption and retention of TRES-based submission. T ogether , these metrics cap- ture both system-level performance and the behavioral response to the scheduling transition. 4.1 Queue W ait Time T able 1 shows queue wait time distributions acr oss the transition. Job counts var y across categories due to dierences in workload across each phase. For CP U workloads, TRES reduced median wait Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems Conference’17, July 2017, Washington, DC, USA day Fraction of Jobs Using Explicit Resource Requests 0.00 0.25 0.50 0.75 1.00 2026-02-01 2026-02-04 2026-02-07 2026-02-10 2026-02-13 2026-02-16 2026-02-19 2026-02-22 2026-02-25 2026-02-28 2026-03-03 2026-03-06 2026-03-09 2026-03-12 2026-03-15 2026-03-18 2026-03-21 2026-03-24 CPU via TRES GPU via TRES Transition from Legacy to TRES-Based Resource Requests Figure 2: Daily fraction of jobs using explicit resource decla- rations for CP U and GP U resources during the transition p e- riod. The data show progressive but non-monotonic adoption of TRES-based submission patterns, reecting variability in workload composition and user b ehavior over time. T able 1: Queue wait time comparison before and after TRES scheduling deployment. All wait times are reported in min- utes. Partition Jobs P50 P75 P90 Legacy CP U (Pre) 393 188 277 . 18 800 . 69 3026 . 26 CP U (TRES) 31 563 2 . 97 106 . 12 797 . 65 Legacy CP U (Post) 49 428 294 . 80 513 . 19 642 . 53 Legacy GP U (Pre) 61 077 81 . 05 1031 . 55 5567 . 88 GP U (TRES) 14 796 3 . 40 137 . 83 461 . 60 Legacy GP U (Post) 6484 344 . 05 590 . 57 748 . 14 time from 277 minutes to under 3 minutes, with 90th-p ercentile latency decreasing from over 3,000 to under 800 minutes. GP U workloads improved similarly: median wait fell from 81 minutes to 3.4 minutes, and the 90th percentile from over 5,500 to under 500 minutes. Residual legacy GP U jobs submitted post-transition experienced substantially higher median wait times (344 minutes), indicating that continued reliance on legacy job submission is asso- ciated with substantially higher wait times as the user-base shifts toward TRES-based scheduling. 4.2 Post- Adoption Behavior and Retention Figure 2 shows the progression of TRES adoption over the transition period, highlighting b oth overall growth in explicit resource declara- tions and short-term variability . Figure 3 shows a Kaplan–Meier [ 4 ] estimate of continued TRES usage. Jobs until reversion ( JUR) is dened as the number of jobs submitte d after initial TRES adoption until rst reversion to legacy , or until the end of the observation window if no reversion occurs. The curve shows a sharp early dr op, with most reversion occurring within the rst few jobs, followed by a gradual decline with wide condence intervals, reecting stable retention among the smaller risk set of high-volume users. This pattern is consistent with early submissions acting as a ltering 0.00 0.25 0.50 0.75 1.00 1 10 100 1000 Jobs Until Rev ersion (JUR) Fraction Remaining on TRES User Retention After Initial TRES Adoption Figure 3: Kaplan–Meier estimate of continue d TRES usage as a function of Jobs Until Reversion to legacy ( JUR) ( log- scaled horizontal axis). The curve shows a sharp early decline followed by a plateau, indicating stable retention among users who p ersist beyond the initial jobs. stage: users who observe impro ved scheduling outcomes, includ- ing reduced me dian queue wait times (e.g., 3 minutes versus 277 minutes), tend to continue using TRES. Approximately one week after deployment, a computational biologist provided unsolicited fee dback: "I just wanted to pass along thanks for the implementation of the new scheduling system. It seems to have b een a roaring success in terms of node utilization. " 5 Conclusion This deployment shows that introducing consumable r esource scheduling in a production HPC environment is fundamentally an operational and behavioral challenge, not just a technical one; technical change alone exposes ineciencies but does not resolve them. Observability was a driver for change, allowing users to observe mismatches between requested and utilized resources, while the 90-day compatibility window gave resear chers the time needed to adapt their workows with minimal disruption. These results indicate that successful scheduling transitions are best achieved through the co ordinated use of telemetry , user en- gagement, and phased roll-out strategies. Acknowledgments The authors thank Fong Banh, Kai Leung W ong, and Ke vin W eiss for their support in assisting users during the transition to resour ce- aware scheduling, including guiding users through update d sub- mission practices and escalating operational issues. References [1] Dror G. Feitelson and Larry Rudolph. 1997. Parallel Job Sche duling: Issues and Approaches. Le cture Notes in Computer Science 1291 (1997), 1–18. [2] George W ashington University Information T echnology. 2026. Pegasus HPC Cluster . https://it.gwu.edu/hpc- pegasus Accessed March 2026. [3] Myeongjae Jeon, Shivaram V enkataraman, Amar Phanishayee, Junjie Qian, W en- cong Xiao, and Fan Y ang. 2019. Analysis of Large-Scale Multi-T enant GP U Clusters Conference’17, July 2017, Washington, DC, USA MacLachlan et al. for DNN Training W orkloads. In Proceedings of the 2019 USENIX A nnual T echnical Conference ( USENIX A TC) . USENIX Association, Renton, W A, USA, 947–958. [4] Edward L. Kaplan and Paul Meier . 1958. Nonparametric Estimation from Incom- plete Observations. J. A mer . Statist. Assoc. 53, 282 (1958), 457–481. [5] SchedMD LLC. 2026. Consumable Resources in Slurm. https://slurm.schedmd. com/cons_tres.html Accessed March 8, 2026. [6] SchedMD LLC. 2026. Generic Resource (GRES) Scheduling. https://slurm.schedmd. com/gres.html Accessed March 8, 2026. [7] SchedMD LLC. 2026. Slurm W orkload Manager Documentation. https://slurm. schedmd.com Accessed March 2026. [8] Andy B. Y oo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing ( JSSPP) (Lecture Notes in Computer Science, V ol. 2862) . Springer , Seattle, W A, USA, 44–60. [9] Zabbix LLC. 2026. Zabbix Monitoring Solution. https://www.zabbix.com Accessed March 2026.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment