Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case stu…

Authors: Glen MacLachlan, Joseph Creech, Rubeel Muhammad Iqbal

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems
Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems Glen MacLachlan ∗ maclach@gwu.edu The George W ashington University W ashington, DC, USA Joseph Creech ∗ jcreech@gwu.edu The George W ashington University W ashington, DC, USA Rubeel Muhammad Iqbal rubeel@gwu.edu The George W ashington University W ashington, DC, USA Clark Gaylord cgaylord@gwu.edu The George W ashington University W ashington, DC, USA Jake Messick jake_messick@gwu.edu The George W ashington University W ashington, DC, USA Abstract Migrating heterogeneous high-performance computing (HPC) sys- tems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workows. This paper presents a case study of transitioning a production academic HPC cluster from node- exclusive to consumable r esource scheduling mid-lifecycle, without disrupting active workloads. W e describe an operational strategy combining a time-bounded compatibility layer , observability-driven feedback, and targeted user engagement to guide adoption of ex- plicit resource declaration. This appr oach protected active research workows throughout the transition, av oiding the disruption that a direct cut-over would have imposed on the user community . Fol- lowing deployment, median queue wait times fell fr om 277 minutes to under 3 minutes for CP U workloads and from 81 minutes to 3.4 minutes for GP U workloads. Users who adopted TRES-based submission exhibited strong long-term retention. These results demonstrate that successful scheduling transitions depend not only on system conguration, but on aligning obser vability , user engage- ment, and operational design. CCS Concepts • Distributed computing methodologies ; • Scheduling ; • Net- work monitoring ; Ke y wor ds HPC, Slurm, TRES, resource scheduling, cluster operations, observ- ability A CM Reference Format: Glen MacLachlan, Joseph Creech, Rub eel Muhammad Iqbal, Clark Gaylord, and Jake Messick. 2026. Operational Strategies for Non-Disruptive Schedul- ing Transitions in Production HPC Systems. In . ACM, New Y ork, N Y, USA, 4 pages. https://doi.org/XXXXXXX.XXXXXXX ∗ Both authors contributed equally to this research. Conference’17, W ashington, DC, USA 2026. ACM ISBN 978-x-xxxx-xxxx-x/Y YY Y/MM https://doi.org/XXXXXXX.XXXXXXX 1 Introduction Ecient resource allocation in heterogeneous high performance computing (HPC) systems relies on scheduling policies that ex- plicitly expose compute resources, including CP Us, memory , and GP Us, to b oth the scheduler and users, a challenge widely studie d in parallel job scheduling [1]. This paper presents a production case study of migrating a het- erogeneous HPC cluster from a node-exclusive to consumable re- source scheduling using Slurm T rackable Resources (TRES) [ 5 , 8 ]. The emphasis is on both the scheduler conguration and an op- erational strategy for deploying potentially disruptive changes in a live production environment while maintaining uninterrupted service. The Ge orge W ashington Univ ersity’s agship HPC system, Pega- sus [ 2 ], is a production cluster supporting a large interdisciplinary research community . At the time of migration, Pegasus comprised approximately 205 compute nodes (42 GP U nodes); 8,600 CP U cores; 280 TB aggregate system memory; 2 PB of storage; and 800 active researchers. Over time, the node-exclusiv e scheduling model produced sev- eral structural ineciencies including suboptimal node packing, increased queue wait times, and misaligned resource allo cation patterns. This exposed both technical and behavioral challenges: submission patterns created measurable ineciency , and transi- tioning researchers to ward explicit resour ce declaration required deliberate operational design. W e present a production deployment strategy that combines phased transition with observability-driven fee dback to enable this transition without disrupting active workloads. Aligning scheduler conguration with researcher-facing operational practices leads to substantial improvements in queue wait times, resource utilization, and sustained adoption of explicit resource requests. T o address the technical component, the cluster migrated to Slurm’s Trackable Resources model with cgroup enforcement and GP U accounting through Slurm’s Generic Resource (GRES) framework [ 6 , 8 ]. While the technical implementation itself is well documented, the opera- tional challenge centered on user adoption. The remainder of the paper focuses on the transition strategy , observability me chanisms, and operational lessons learned during the migration. Conference’17, July 2017, W ashington, DC, USA MacLachlan et al. 2 Legacy Model Friction and Transition Strategy Under Pegasus’s original node-exclusive scheduling discipline, the only mechanism available to dierentiate resource classes was parti- tion assignment. O ver time , this produced a proliferation of special- ized partitions: high-memory nodes, GP U variants, debug queues, visualization nodes, and high-throughput congurations. Each addi- tional partition added incrementally to expose resource capabilities, while also increasing the perceived complexity of the submission environment for users. Under consumable resource scheduling, such distinctions are expressed through per-job resour ce requests rather than partition identity . The resulting complexity reinforced several behaviors: • Partition selection based on inferred resource availability rather than resource-as-declaration • GP U allocation without corresponding GP U utilization • Low core utilization per node due to implicit node exclusivity assumptions Pre-migration telemetry r evealed consistent patterns of lo w CP U packing eciency , frequent allocation of GP U resources without uti- lization, and queue congestion driv en by node exclusivity . These in- eciencies are consistent with known limitations of node-exclusive scheduling, including resource fragmentation and suboptimal uti- lization. Similar ineciencies have been observed in large-scale GP U cluster environments, where mismatches between requested and actual resource usage lead to under-utilization [ 3 ]. These obser- vations motivated the nee d for a transition strategy that addressed both scheduling policy and user submission b ehavior . Slurm’s T rackable Resources (TRES) model, implemented via the select/cons_tres plugin, is enabled or disable d cluster-wide [ 5 ]. As a result, changes to the scheduling model must be applied uni- formly , requiring operational strategies to support user transition in a live pr oduction environment. A direct cutov er was therefore rejected in favor of an approach that would allow the research community to adapt without interruption to active workloads. T o reduce user friction, we implemented a 90-day transition pe- riod during which legacy submission patterns were permitted but monitored. This approach preserved continuity of service while al- lowing users to adapt to e xplicit resour ce declarations. Lightw eight wrapper scripts translated legacy job submissions into resource- aware equivalents where possible, enabling existing workows to continue functioning while guiding users towar d explicit CP U, memory , and GP U requests. These wrapp ers w ere time-bounded by the 90-day window , giving users a structured path towar d native TRES-based submission patterns. During the transition, legacy submission patterns were p ermitted but monitored and users received targeted feedback on historical ineciencies. T o ease user adoption, do cumentation was produced that emphasized resource declaration ov er partition identity , and oce hours and town halls contextualized the change in terms of throughput and fairness. 3 Observability as Stabilization Me chanism System observability played a critical role in evaluating the im- pact of the scheduling migration. Monitoring combined Slurm [ 7 ] accounting data with Zabbix infrastructure telemetry [ 9 ]. Slurm ac- counting records provided job-level telemetry including requested Figure 1: CP U workloads running on a GP U node, under- utilizing the available GP U devices. resources, allocated resources, job duration, and queue wait time . Zabbix agents collecte d no de-level metrics including memory usage, GP U utilization, and CP U core utilization. Operational metrics tracked included: • Queue wait time percentiles by partition • CP U packing eciency • GP U allocation alignment • Memory utilization relative to requested memory These metrics enabled targeted feedback to users and supporte d rapid behavioral adjustment. For example, real-time Zabbix visual- izations of GP U under-utilization were used during town halls and one-on-one consultations to demonstrate mismatches between re- quested and actual resource usage. This visibility helpe d researchers understand their own resource usage patterns and guide d them toward more eective job submissions. Figure 1 shows three separate legacy jobs running on a single GP U node over a ve-day perio d. Despite occupying a GP U-capable node, these jobs exhibit negligible or no GPU utilization, indicating that GP U resources were allocated but not consumed. This pattern reects submission strategies that prioritized rapid job start times over appropriate resource matching, resulting in inecient use of scarce GP U capacity . By contrast, under TRES-based schedul- ing, such mismatches are explicitly surfaced and discouraged, as resource requests must align with actual usage. 4 Measured Eects Outcomes were evaluated across two phases: a pre-migration base- line and a 90-day transition window . During these phases, we tracked changes in queue wait times as well as user adoption and retention of TRES-based submission. T ogether , these metrics cap- ture both system-level performance and the behavioral response to the scheduling transition. 4.1 Queue W ait Time T able 1 shows queue wait time distributions acr oss the transition. Job counts var y across categories due to dierences in workload across each phase. For CP U workloads, TRES reduced median wait Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems Conference’17, July 2017, Washington, DC, USA day Fraction of Jobs Using Explicit Resource Requests 0.00 0.25 0.50 0.75 1.00 2026-02-01 2026-02-04 2026-02-07 2026-02-10 2026-02-13 2026-02-16 2026-02-19 2026-02-22 2026-02-25 2026-02-28 2026-03-03 2026-03-06 2026-03-09 2026-03-12 2026-03-15 2026-03-18 2026-03-21 2026-03-24 CPU via TRES GPU via TRES Transition from Legacy to TRES-Based Resource Requests Figure 2: Daily fraction of jobs using explicit resource decla- rations for CP U and GP U resources during the transition p e- riod. The data show progressive but non-monotonic adoption of TRES-based submission patterns, reecting variability in workload composition and user b ehavior over time. T able 1: Queue wait time comparison before and after TRES scheduling deployment. All wait times are reported in min- utes. Partition Jobs P50 P75 P90 Legacy CP U (Pre) 393 188 277 . 18 800 . 69 3026 . 26 CP U (TRES) 31 563 2 . 97 106 . 12 797 . 65 Legacy CP U (Post) 49 428 294 . 80 513 . 19 642 . 53 Legacy GP U (Pre) 61 077 81 . 05 1031 . 55 5567 . 88 GP U (TRES) 14 796 3 . 40 137 . 83 461 . 60 Legacy GP U (Post) 6484 344 . 05 590 . 57 748 . 14 time from 277 minutes to under 3 minutes, with 90th-p ercentile latency decreasing from over 3,000 to under 800 minutes. GP U workloads improved similarly: median wait fell from 81 minutes to 3.4 minutes, and the 90th percentile from over 5,500 to under 500 minutes. Residual legacy GP U jobs submitted post-transition experienced substantially higher median wait times (344 minutes), indicating that continued reliance on legacy job submission is asso- ciated with substantially higher wait times as the user-base shifts toward TRES-based scheduling. 4.2 Post- Adoption Behavior and Retention Figure 2 shows the progression of TRES adoption over the transition period, highlighting b oth overall growth in explicit resource declara- tions and short-term variability . Figure 3 shows a Kaplan–Meier [ 4 ] estimate of continued TRES usage. Jobs until reversion ( JUR) is dened as the number of jobs submitte d after initial TRES adoption until rst reversion to legacy , or until the end of the observation window if no reversion occurs. The curve shows a sharp early dr op, with most reversion occurring within the rst few jobs, followed by a gradual decline with wide condence intervals, reecting stable retention among the smaller risk set of high-volume users. This pattern is consistent with early submissions acting as a ltering 0.00 0.25 0.50 0.75 1.00 1 10 100 1000 Jobs Until Rev ersion (JUR) Fraction Remaining on TRES User Retention After Initial TRES Adoption Figure 3: Kaplan–Meier estimate of continue d TRES usage as a function of Jobs Until Reversion to legacy ( JUR) ( log- scaled horizontal axis). The curve shows a sharp early decline followed by a plateau, indicating stable retention among users who p ersist beyond the initial jobs. stage: users who observe impro ved scheduling outcomes, includ- ing reduced me dian queue wait times (e.g., 3 minutes versus 277 minutes), tend to continue using TRES. Approximately one week after deployment, a computational biologist provided unsolicited fee dback: "I just wanted to pass along thanks for the implementation of the new scheduling system. It seems to have b een a roaring success in terms of node utilization. " 5 Conclusion This deployment shows that introducing consumable r esource scheduling in a production HPC environment is fundamentally an operational and behavioral challenge, not just a technical one; technical change alone exposes ineciencies but does not resolve them. Observability was a driver for change, allowing users to observe mismatches between requested and utilized resources, while the 90-day compatibility window gave resear chers the time needed to adapt their workows with minimal disruption. These results indicate that successful scheduling transitions are best achieved through the co ordinated use of telemetry , user en- gagement, and phased roll-out strategies. Acknowledgments The authors thank Fong Banh, Kai Leung W ong, and Ke vin W eiss for their support in assisting users during the transition to resour ce- aware scheduling, including guiding users through update d sub- mission practices and escalating operational issues. References [1] Dror G. Feitelson and Larry Rudolph. 1997. Parallel Job Sche duling: Issues and Approaches. Le cture Notes in Computer Science 1291 (1997), 1–18. [2] George W ashington University Information T echnology. 2026. Pegasus HPC Cluster . https://it.gwu.edu/hpc- pegasus Accessed March 2026. [3] Myeongjae Jeon, Shivaram V enkataraman, Amar Phanishayee, Junjie Qian, W en- cong Xiao, and Fan Y ang. 2019. Analysis of Large-Scale Multi-T enant GP U Clusters Conference’17, July 2017, Washington, DC, USA MacLachlan et al. for DNN Training W orkloads. In Proceedings of the 2019 USENIX A nnual T echnical Conference ( USENIX A TC) . USENIX Association, Renton, W A, USA, 947–958. [4] Edward L. Kaplan and Paul Meier . 1958. Nonparametric Estimation from Incom- plete Observations. J. A mer . Statist. Assoc. 53, 282 (1958), 457–481. [5] SchedMD LLC. 2026. Consumable Resources in Slurm. https://slurm.schedmd. com/cons_tres.html Accessed March 8, 2026. [6] SchedMD LLC. 2026. Generic Resource (GRES) Scheduling. https://slurm.schedmd. com/gres.html Accessed March 8, 2026. [7] SchedMD LLC. 2026. Slurm W orkload Manager Documentation. https://slurm. schedmd.com Accessed March 2026. [8] Andy B. Y oo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing ( JSSPP) (Lecture Notes in Computer Science, V ol. 2862) . Springer , Seattle, W A, USA, 44–60. [9] Zabbix LLC. 2026. Zabbix Monitoring Solution. https://www.zabbix.com Accessed March 2026.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment