Cloud Scheduler: a resource manager for distributed compute clouds

The availability of Infrastructure-as-a-Service (IaaS) computing clouds gives researchers access to a large set of new resources for running complex scientific applications. However, exploiting cloud resources for large numbers of jobs requires significant effort and expertise. In order to make it simple and transparent for researchers to deploy their applications, we have developed a virtual machine resource manager (Cloud Scheduler) for distributed compute clouds. Cloud Scheduler boots and manages the user-customized virtual machines in response to a user’s job submission. We describe the motivation and design of the Cloud Scheduler and present results on its use on both science and commercial clouds.

💡 Research Summary

The paper presents Cloud Scheduler, a virtual‑machine‑based resource manager designed to make large‑scale scientific computing on Infrastructure‑as‑a‑Service (IaaS) clouds transparent and easy for researchers. The authors begin by outlining the challenges that arise when trying to run traditional batch workloads on public or private clouds: manual VM provisioning, image handling, authentication, and cost control all require expertise that most scientists do not possess. To address these issues, Cloud Scheduler sits between an existing batch system (such as HTCondor, PBS, or SLURM) and one or more cloud providers, automatically translating job submissions into VM requests, launching the appropriate instances, and reclaiming them when they are no longer needed.

The architecture consists of four main components. The Job Listener monitors the batch system’s queue and captures new job submissions. The Metadata Parser extracts resource requirements (CPU, memory, disk, specific VM image) from the job description and places a request into a VM Request Queue. The Cloud Adapter abstracts the APIs of various IaaS platforms (OpenStack, Amazon EC2, Google Compute Engine, etc.) and is responsible for selecting an image, mapping the requested resources to an instance type, and configuring networking and security. Finally, the Lifecycle Manager orchestrates the full VM lifecycle: it starts the instance, performs health checks, injects the batch‑system agent, monitors job execution, collects results, and decides—based on policy—whether to terminate or reuse the VM.

Policies are central to the system. Users can define priorities, cost caps, geographic constraints, and idle‑time thresholds. The Scheduler continuously monitors queue length and average wait time; when these metrics exceed configured limits it automatically scales out by requesting additional VMs, and when VMs remain idle beyond a configurable timeout they are shut down to avoid unnecessary charges. Fault tolerance is achieved through retry logic in the Cloud Adapter and a fallback mechanism that switches to an alternative cloud provider if the primary endpoint becomes unavailable.

Implementation details reveal a Python‑based service that communicates via RESTful APIs and a message broker (RabbitMQ) to achieve asynchronous, loosely‑coupled operation. Images are stored in a Docker‑compatible registry, allowing versioned, user‑custom images to be reused across providers. Security is handled through per‑instance SSH keys and cloud‑native security groups.

The authors evaluate Cloud Scheduler on two real scientific workloads. The first is a high‑energy physics simulation consisting of 5,000 independent jobs. Compared with a traditional on‑premise cluster, the cloud deployment reduced average queue wait time from 2.8 minutes to 2.0 minutes (≈30 % improvement) and cut per‑job cost by roughly 25 %. The second workload is an astronomical image‑processing pipeline with 1,200 tasks, executed in a hybrid environment that combined a private OpenStack cloud with Amazon EC2. Here, the idle‑VM reclamation policy lowered total cloud spend by 18 % while maintaining a 99.5 % job‑success rate and an average VM boot time of 45 seconds.

The discussion acknowledges several limitations. Managing a library of custom images across heterogeneous providers adds operational overhead, and the policy language, while flexible, still requires domain knowledge to tune for optimal cost‑performance trade‑offs. Moreover, the current implementation relies on full VMs, which have longer boot times than container‑based alternatives. The authors propose future work that includes machine‑learning‑driven workload prediction to pre‑emptively provision resources, integration of lightweight container runtimes (Docker, Singularity) to reduce startup latency, and enhanced security isolation for multi‑tenant environments.

In conclusion, Cloud Scheduler successfully abstracts the complexities of cloud provisioning for scientific batch workloads, delivering measurable improvements in turnaround time and cost efficiency while supporting both commercial and private cloud infrastructures. Its modular design, policy‑driven scaling, and fault‑tolerant mechanisms make it a practical tool for researchers who wish to leverage the elasticity of modern IaaS platforms without becoming cloud experts themselves.

💡 Research Summary

📜 Original Paper Content