MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration
We present MonkeyTree, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques. As cloud operators co-locate ML training jobs on shared, oversubscribed networks, congestion degrades training throughput for over a third of jobs. Prior approaches either rely on routing and flow scheduling–which we show have fundamental limits when traffic exceeds capacity–or require costly full-bisection bandwidth topologies with packet spraying. MonkeyTree exploits characteristics of ML training traffic: ring-based collectives generate exactly one cross-rack flow per rack a job spans, making congestion-free placements achievable. The sparse constraint structure admits abundant valid configurations, making them easy to reach with few migrations. Once reached, low fragmentation is self-reinforcing, as new arrivals disturb only a few racks. MonkeyTree formulates defragmentation as an integer linear program that minimizes worker movements, subject to per-rack fragmentation bounds. We prove a tight bound showing any placement can be defragmented to at most two cross-rack fragments per ToR, and extend the formulation to hybrid parallelism with multiple rings per server. Migration is implemented via in-memory checkpoint-and-restore over RDMA, incurring only 9.02 seconds of system overhead end-to-end per worker. We evaluate MonkeyTree using a custom simulator modeling clusters of up to 2,048 H200 GPUs and prototype on a five-node A100 testbed. MonkeyTree improves average job completion time by 14 percent over the next best baseline on a cluster of 1,024 GPUs with a 4:1 oversubscription. With a high 16:1 oversubscription ratio and 2,048 GPUs, MonkeyTree keeps p99 job completion time within 5 percent of ideal.
💡 Research Summary
MonkeyTree introduces a migration‑driven approach to eliminate network congestion in multi‑tenant GPU clusters, shifting the focus from routing and flow‑scheduling techniques to job placement. The authors first demonstrate that existing network‑layer solutions hit a hard ceiling: when aggregate traffic exceeds the capacity of an oversubscribed datacenter fabric, flow collisions become inevitable regardless of optimal routing or sophisticated scheduling. Full‑bisection bandwidth designs avoid this problem but incur prohibitive hardware costs.
The key observation underlying MonkeyTree is that deep‑learning training traffic is dominated by ring‑based collective operations (data‑parallel or fully‑sharded data‑parallel). Such collectives generate exactly one cross‑rack flow per rack that a job spans, independent of the number of GPUs in that rack. Consequently, congestion can be avoided simply by ensuring that each rack’s outgoing demand does not exceed its number of uplinks. This “congestion‑free” condition is both achievable and abundant: the placement of workers inside a rack does not affect traffic, and each rack’s constraint is independent, yielding a large space of valid configurations. Moreover, once a cluster reaches a congestion‑free state, it is self‑reinforcing—new arrivals disturb only a few racks, and a small number of targeted migrations can restore compliance.
MonkeyTree’s architecture consists of a centralized controller that monitors the current placement (in cooperation with the existing scheduler) and a set of daemon agents on each server that execute migrations. When a rack’s fragmentation exceeds a threshold derived from its uplink count, the controller formulates an integer linear program (ILP) that minimizes the number of worker movements while satisfying per‑rack fragmentation bounds. The authors prove a tight bound: any placement can be defragmented to at most two cross‑rack fragments per top‑of‑rack (ToR) switch for single‑ring jobs, and they extend the formulation to hybrid parallelism where a server runs multiple rings. Because the ILP’s solution time scales with the number of required moves, and most placements need two or fewer moves (≈80 % of the time), solving the ILP is fast in practice (average 0.89 s for a 1 024‑GPU cluster, triggered roughly every 102 minutes at 80 % load).
Migration is implemented using in‑memory checkpoint‑and‑restore over RDMA, leveraging PyTorch’s distributed checkpointing library. The end‑to‑end overhead per worker is only 9.02 seconds, a negligible cost compared to typical training iteration times.
Evaluation is performed via a custom Rust‑based simulator modeling up to 2 048 H200 GPUs and a 400 Gbps network, as well as a five‑node A100 testbed. Workloads include GPT‑3, Llama 2, and Llama 3 variants with various parallelism strategies. In a 1 024‑GPU cluster with a 4:1 oversubscription ratio, MonkeyTree improves average job‑completion time (JCT) by 14 % over the next best baseline. Under a more extreme 16:1 oversubscription on 2 048 GPUs, the p99 JCT remains within 5 % of the ideal (no‑congestion) case. The prototype confirms the low migration overhead and validates the ILP‑driven move minimization.
The paper concludes that migration‑based defragmentation is not only practical for GPU training workloads but also highly effective, offering a new paradigm for congestion mitigation that complements, rather than replaces, network‑layer optimizations. Limitations include the focus on ring‑based collectives and the reliance on ILP scalability for highly fragmented states; future work will explore extensions to non‑ring collectives, priority‑aware migrations, and integration with SLA‑driven scheduling policies.
Comments & Academic Discussion
Loading comments...
Leave a Comment