Scientific Workflow Applications on Amazon EC2

Scientific Workflow Applications on Amazon EC2
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The proliferation of commercial cloud computing providers has generated significant interest in the scientific computing community. Much recent research has attempted to determine the benefits and drawbacks of cloud computing for scientific applications. Although clouds have many attractive features, such as virtualization, on-demand provisioning, and “pay as you go” usage-based pricing, it is not clear whether they are able to deliver the performance required for scientific applications at a reasonable price. In this paper we examine the performance and cost of clouds from the perspective of scientific workflow applications. We use three characteristic workflows to compare the performance of a commercial cloud with that of a typical HPC system, and we analyze the various costs associated with running those workflows in the cloud. We find that the performance of clouds is not unreasonable given the hardware resources provided, and that performance comparable to HPC systems can be achieved given similar resources. We also find that the cost of running workflows on a commercial cloud can be reduced by storing data in the cloud rather than transferring it from outside.


💡 Research Summary

The paper investigates whether commercial cloud computing, specifically Amazon EC2, can meet the performance and cost requirements of scientific workflow applications when compared with a traditional high‑performance computing (HPC) platform, the NCSA Abe cluster. Scientific workflows are loosely‑coupled parallel applications composed of many tasks linked by data and control dependencies; they are common in many domains such as astronomy, seismology, and bio‑informatics. To cover a broad spectrum of resource demands, the authors selected three representative workflows: Montage (an I/O‑intensive astronomical image‑mosaicking pipeline), Broadband (a memory‑intensive seismic‑simulation pipeline), and Epigenomics (a CPU‑intensive short‑read mapping pipeline).

The experimental setup involved five EC2 instance types (m1.small, m1.large, m1.xlarge, c1.medium, c1.xlarge) spanning 0.5 to 8 virtual CPUs, 1.7 GB to 15 GB RAM, 1 Gbps Ethernet, and local disk storage, as well as two configurations of the Abe cluster: a node with a local disk (abe.local) and a node with a Lustre parallel file system (abe.lustre) connected via 10 Gbps InfiniBand. All workflows were managed by Pegasus, DAGMan, and Condor; on EC2 the required software stack was baked into standard Fedora Core images, and input data were stored on Elastic Block Store (EBS) volumes. EBS provides persistent, block‑level storage that can be detached and re‑attached across instances, eliminating repeated data transfers from external sites.

Performance results show that the I/O‑bound Montage benefits dramatically from a high‑performance parallel file system: on Abe with Lustre it runs fastest, while EC2 using only local disks suffers a 1.5× slowdown due to limited I/O bandwidth. Nevertheless, when the largest EC2 instance (c1.xlarge, 8 Xeon cores) is used, the runtime approaches that of the HPC node, indicating that sufficient CPU and memory can offset I/O limitations. The memory‑bound Broadband runs without swapping on any instance offering ≥1 GB RAM; performance scales roughly with memory size and network latency, again favoring the Lustre configuration for large intermediate files. The CPU‑bound Epigenomics is most sensitive to processor micro‑architecture: Xeon‑based c1.xlarge instances achieve ~20 % lower runtimes than Opteron‑based m1.large instances, reflecting higher FLOP‑per‑cycle capability.

Cost analysis considered two data‑handling strategies. In the first, input data are transferred from an external site for each workflow execution; in the second, the data are uploaded once to an EBS volume and reused across multiple runs. For workflows with large inputs (Montage’s 4.2 GB, Broadband’s 6 GB), the EBS‑reuse model reduces total cost by more than 30 % because data‑transfer charges dominate the expense. Moreover, the “pay‑as‑you‑go” pricing of EC2 allows fine‑grained cost optimization: short, bursty workloads are cheapest on the smallest instance (m1.small), while long‑running, resource‑intensive runs achieve the best cost‑per‑performance ratio on the largest instance (c1.xlarge).

The authors conclude that commercial clouds can deliver performance comparable to traditional HPC systems when the underlying resources (CPU, memory, network, storage) are appropriately matched to the workflow’s characteristics. Virtualization overhead is modest, and the elasticity of cloud resources—combined with persistent, low‑cost storage such as EBS—offers significant economic advantages, especially for workflows that can reuse input data across runs. These findings suggest that scientific researchers should consider cloud platforms as viable, cost‑effective alternatives to dedicated HPC clusters, provided they carefully select instance types and storage strategies aligned with the specific I/O, memory, and compute profiles of their applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment