Provisioning Spot Market Cloud Resources to Create Cost-effective Virtual Clusters
Infrastructure-as-a-Service providers are offering their unused resources in the form of variable-priced virtual machines (VMs), known as “spot instances”, at prices significantly lower than their standard fixed-priced resources. To lease spot instances, users specify a maximum price they are willing to pay per hour and VMs will run only when the current price is lower than the user’s bid. This paper proposes a resource allocation policy that addresses the problem of running deadline-constrained compute-intensive jobs on a pool of composed solely of spot instances, while exploiting variations in price and performance to run applications in a fast and economical way. Our policy relies on job runtime estimations to decide what are the best types of VMs to run each job and when jobs should run. Several estimation methods are evaluated and compared, using trace-based simulations, which take real price variation traces obtained from Amazon Web Services as input, as well as an application trace from the Parallel Workload Archive. Results demonstrate the effectiveness of running computational jobs on spot instances, at a fraction (up to 60% lower) of the price that would normally cost on fixed priced resources.
💡 Research Summary
The paper addresses the challenge of executing deadline‑constrained, compute‑intensive jobs using only spot instances—variable‑priced virtual machines offered by Infrastructure‑as‑a‑Service (IaaS) providers such as Amazon EC2. Spot instances are significantly cheaper than on‑demand instances but can be terminated without notice when the market price exceeds the user’s bid, making them unreliable for time‑critical workloads. The authors propose a comprehensive resource provisioning and scheduling framework that combines runtime estimation, dynamic bidding, and cost‑aware VM selection to build a virtual cluster solely from spot instances while meeting job deadlines.
System Architecture
The system consists of two logical components: a Broker and a Cloud Manager. The Broker receives job submissions (including required CPU, memory, user‑provided runtime estimate, and deadline), maintains a queue of unscheduled jobs, and makes all scheduling decisions. The Cloud Manager interacts with the cloud provider to request, extend, or terminate spot instances based on the Broker’s instructions and the current spot market price. This separation mirrors a typical client‑server model but is specialized for a fully cloud‑based cluster, unlike hybrid approaches that augment an existing on‑premise cluster.
Runtime Estimation
User‑provided runtime estimates are often inaccurate, leading to either over‑provisioning (wasting money) or under‑provisioning (missing deadlines). To mitigate this, the authors evaluate several automatic estimation techniques: (1) simple average of the two most recent jobs from the same user, (2) linear regression based on historical execution data, and (3) more sophisticated history‑based averages. Experiments show that even the simplest method yields substantial improvements over raw user estimates, confirming that modest prediction accuracy is sufficient for the cost‑driven scheduling problem.
Provisioning and Scheduling Algorithm (Algorithm 1)
The core of the proposal is an online algorithm executed at regular intervals (10 seconds in the experiments). The steps are:
- Insert newly arrived jobs into the unscheduled list.
- For each job, compute estimated runtimes on every available instance type (e.g., c1.medium, c1.xlarge).
- Attempt to place the job on an idle VM that will remain idle for at least the remaining portion of the current billing hour.
- If no immediate placement is possible, calculate the maximum wait time that still allows the job to meet its deadline. If the job can be postponed, move it to a “non‑urgent” list.
- If postponement is not viable, evaluate two alternatives: extending the lease of an already‑running VM or launching a new VM. The decision is based on the estimated cost of each alternative, which incorporates the current spot price and the job’s predicted runtime.
- After handling the current job, re‑examine any idle VMs and try to assign previously postponed jobs.
- For each job that has been assigned, schedule a correction event at the predicted completion time. If the actual runtime deviates, the correction event reinserts the job into the unscheduled queue with an updated estimate, allowing the system to adapt to estimation errors.
The algorithm explicitly balances two objectives: minimizing the total monetary cost (by reducing the number of billed hours and selecting low‑price instance types) and satisfying deadline constraints (by postponing only when safe and by extending leases when necessary).
Exploiting Price‑Performance Ratios
Spot prices fluctuate independently for each instance type. Consequently, the price‑per‑performance ratio (cost per ECU) varies over time. The framework continuously monitors current spot prices and selects the instance type that offers the best ratio for a given job. For highly parallelizable workloads, a job that would normally require several low‑CPU instances can be migrated to a single high‑CPU instance when its spot price drops, dramatically reducing both execution time and cost.
Experimental Evaluation
The authors conduct trace‑based simulations using two real data sources: (a) historical spot price logs from Amazon EC2, and (b) a workload trace from the Parallel Workload Archive, which contains a realistic mix of HPC jobs with deadlines. The simulation runs for a 24‑hour period, comparing the proposed spot‑only policy against a baseline that uses on‑demand instances exclusively.
Key findings include:
- Cost Reduction – The spot‑only policy achieves an average cost saving of 45 % relative to the on‑demand baseline, with a best‑case reduction of up to 60 %. Even in worst‑case price scenarios, savings exceed 30 %.
- Deadline Compliance – Over 95 % of jobs meet their deadlines, demonstrating that the algorithm effectively mitigates the volatility of spot markets.
- Impact of Estimation Accuracy – Over‑estimation leads to unnecessary VM provisioning and higher costs, while under‑estimation increases deadline miss rates. The simple average‑of‑two‑previous‑jobs estimator strikes a good balance.
- Benefit of Dynamic Type Selection – Choosing the instance type with the lowest current price‑per‑performance ratio yields an additional ~10 % cost improvement over a static type selection strategy.
Limitations and Future Work
The current implementation assumes a single cloud provider (AWS) and a static bidding strategy (the user’s maximum bid is fixed per job). Extending the model to multi‑provider environments could exploit inter‑provider price differentials. Incorporating predictive models for spot price trends (e.g., time‑series forecasting with LSTM networks) would enable proactive bid adjustments. Moreover, integrating fault‑tolerance mechanisms such as checkpointing and migration would further reduce the impact of abrupt instance termination.
Conclusion
The study demonstrates that a carefully designed, runtime‑aware provisioning and scheduling policy can harness the economic advantages of spot instances while still honoring strict deadline requirements. By continuously estimating job runtimes, dynamically selecting the most cost‑effective VM types, and judiciously postponing or extending leases, organizations can build fully cloud‑based virtual clusters at a fraction of the cost of traditional on‑demand resources. This work provides a practical blueprint for cost‑conscious cloud users and opens avenues for richer, predictive, and multi‑cloud extensions.
Comments & Academic Discussion
Loading comments...
Leave a Comment