BOA Constrictor: Squeezing Performance out of GPUs in the Cloud via Budget-Optimal Allocation

BOA Constrictor: Squeezing Performance out of GPUs in the Cloud via Budget-Optimal Allocation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The past decade has seen a dramatic increase in demand for GPUs to train Machine Learning (ML) models. Because it is prohibitively expensive for most organizations to build and maintain a large GPU cluster, organizations instead choose to rent GPUs from cloud providers. The customer is responsible for devising a policy for (i) deciding how many GPUs to rent at every moment in time to process a stream of ML training jobs and (ii) allocating the rented GPUs among the currently active jobs in the system. Because ML training jobs can be parallelized across different numbers of GPUs, the customer generally has many options for how many GPUs to use for each job. Allocating more GPUs to a single training job will cause the job to complete more quickly. However, the customer pays for each GPU-hour they use, and a training job receives a diminishing marginal benefit from running on additional GPUs. Hence, allocating too many GPUs to a single training job can dramatically increase the overall cost that the customer pays to the cloud provider. This gives rise to a cost-performance tradeoff that customers must balance when running training jobs in the cloud. To balance the cost-performance tradeoff, we develop BOA Constrictor, a new scheduler for ML training jobs which uses a Budget-Optimal Allocation (BOA) policy to squeeze the highest level of performance out of a cloud-deployed GPU cluster given a fixed budget constraint. We explicitly formulate the problem as a budget-constrained scheduling problem and derive the BOA policy which minimizes the average job completion time (JCT) of a stream of arriving jobs subject to the user’s budget. For a given budget level, we demonstrate that BOA Constrictor can reduce average JCT by 1.6 times in small-scale implementation experiments and by 2 times in detailed, large-scale simulations compared to state-of-the-art heuristic based schedulers.


💡 Research Summary

The paper addresses the problem of efficiently renting and allocating GPU resources in the cloud for a stream of machine‑learning (ML) training jobs under a fixed monetary budget. While many organizations now rely on cloud providers (AWS, Azure, GCP) to obtain GPU instances, they must decide (i) how many GPUs to rent at any moment and (ii) how to distribute those GPUs among the active training jobs. Existing solutions either assume a static cluster size or use heuristic autoscaling policies that do not explicitly consider the cost‑performance trade‑off, leading to sub‑optimal spending.

The authors introduce a novel framework called BOA Constrictor that implements a Budget‑Optimal Allocation (BOA) policy. The BOA policy is derived from a stochastic model where each job has a sub‑linear, concave speed‑up function s(k) describing how much faster it trains when allocated k GPUs. Because s(k) is sub‑linear, allocating more GPUs reduces job completion time but increases total GPU‑hours (cost) by a factor of k/s(k). The model allows arbitrary job size distributions, inter‑arrival processes, and time‑varying speed‑up functions, making it far more general than prior work.

Using Lagrangian relaxation and Markov decision process (MDP) analysis, the authors prove that an optimal policy exists that minimizes the long‑run average job completion time (JCT) while satisfying a constraint on the average number of GPUs rented per unit time (the budget). The optimal policy consists of two components:

  1. BOA Allocation – given the current set of jobs and a budget parameter λ, compute for each job the number of GPUs that equalizes the marginal reduction in JCT with the marginal cost implied by λ. This yields a closed‑form allocation rule that can be computed efficiently because the objective is convex in λ.
  2. Budget‑Driven Scaling – adjust the total cluster size to match the sum of the allocations while respecting the budget. If the sum exceeds the budget, the cluster is shrunk; if it falls below, the cluster is expanded. Scaling actions respect the 1‑2 minute minimum rental granularity of cloud instances and are throttled to avoid excessive churn.

A key theoretical contribution is that the BOA allocation function is continuous and convex in the budget, enabling fast binary‑search or gradient‑based methods to find the optimal λ online. The authors also show how the same analysis extends to heterogeneous GPU types (different compute/memory capabilities) and how the policy can be decoupled into an offline optimization phase and a lightweight online execution phase, keeping runtime overhead negligible.

Implementation is built on top of the AdaptDL distributed‑training framework. The central scheduler periodically (e.g., every 5 minutes) gathers job statistics, runs the BOA optimizer, and issues two kinds of actions: (a) per‑job GPU allocation adjustments, and (b) cluster‑size scaling commands to the cloud provider. The system incorporates practical considerations such as scaling latency, pre‑emption costs (parameter synchronization when jobs are re‑allocated), and a “budget buffer” to prevent temporary budget violations.

Evaluation consists of two parts:

  • Small‑scale real‑world experiments on a private cluster of up to 32 GPUs using diverse workloads (CIFAR‑10/ResNet, ImageNet, BERT). Compared to a state‑of‑the‑art heuristic called Pollux (goodput‑based autoscaling), BOA Constrictor reduces average JCT by 1.6× and the 95th‑percentile JCT by 2.3×.
  • Large‑scale simulations using a detailed ML‑training simulator that models up to 10 000 GPUs, realistic bursty arrival patterns, and time‑varying speed‑up curves. Under the same budget, BOA Constrictor achieves up to a 2× reduction in average JCT; conversely, for a target average JCT it cuts the required budget by roughly 2×. The policy remains stable under workload bursts and changing speed‑up functions, consistently respecting the budget constraint.

The authors also provide an analytical comparison showing that Pollux’s decisions are driven solely by queue length or goodput, whereas BOA explicitly leverages each job’s speed‑up curve, leading to markedly different allocation patterns that are provably optimal under the model.

Limitations are acknowledged: the current prototype focuses on data‑parallel training; extending to model‑parallel or pipeline‑parallel schemes, handling multi‑instance GPUs, and incorporating dynamic spot‑price markets are left for future work. Moreover, BOA requires knowledge (or accurate estimation) of each job’s speed‑up function; the authors suggest online learning techniques to address this.

In summary, the paper makes three major contributions: (1) a rigorous theoretical formulation of the budget‑constrained GPU rental problem for parallelizable jobs, (2) the derivation of an optimal, efficiently computable BOA policy, and (3) a practical system implementation that demonstrates up to 2× performance gains over the best existing heuristics. The work bridges a critical gap between cloud economics and systems scheduling, offering a concrete tool for organizations to “squeeze” the most performance out of their GPU budgets.


Comments & Academic Discussion

Loading comments...

Leave a Comment