Cloud Infrastructure Provenance Collection and Management to Reproduce Scientific Workflow Execution

Cloud Infrastructure Provenance Collection and Management to Reproduce   Scientific Workflow Execution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The emergence of Cloud computing provides a new computing paradigm for scientific workflow execution. It provides dynamic, on-demand and scalable resources that enable the processing of complex workflow-based experiments. With the ever growing size of the experimental data and increasingly complex processing workflows, the need for reproducibility has also become essential. Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. One of the obstacles in reproducing an experiment execution is the lack of information about the execution infrastructure in the collected provenance. This information becomes critical in the context of Cloud in which resources are provisioned on-demand and by specifying resource configurations. Therefore, a mechanism is required that enables capturing of infrastructure information along with the provenance of workflows executing on the Cloud to facilitate the re-creation of execution environment on the Cloud. This paper presents a framework, ReCAP, along with the proposed mapping approaches that aid in capturing the Cloud-aware provenance information and help in re-provisioning the execution resource on the Cloud with similar configurations. Experimental evaluation has shown the impact of different resource configurations on the workflow execution performance, therefore justifies the need for collecting such provenance information in the context of Cloud. The evaluation has also demonstrated that the proposed mapping approaches can capture Cloud information in various Cloud usage scenarios without causing performance overhead and can also enable the re-provisioning of resources on Cloud. Experiments were conducted using workflows from different scientific domains such as astronomy and neuroscience to demonstrate the applicability of this research for different workflows.


💡 Research Summary

The paper addresses a critical gap in reproducibility of scientific workflows executed on cloud platforms. While cloud computing offers on‑demand, scalable resources that enable complex, data‑intensive experiments, existing provenance systems typically capture only workflow logic, data lineage, and execution logs, neglecting the dynamic infrastructure details that can significantly affect performance and results. To bridge this gap, the authors introduce ReCAP, a framework that automatically records “cloud‑aware provenance” – detailed metadata about the virtual machines or containers provisioned for each workflow task.

ReCAP sits between the workflow manager (e.g., Pegasus, Airflow) and the cloud provider’s API. During execution it intercepts resource allocation calls and stores attributes such as CPU core count, memory size, storage type and capacity, network bandwidth, operating‑system image, and allocation timestamps. This infrastructure information is merged with the traditional provenance record, creating a unified provenance artifact that fully describes both the computational process and the execution environment.

Three mapping strategies are proposed. The first, static mapping, relies on pre‑defined resource templates embedded in the workflow description. The second, dynamic mapping, captures real‑time metadata returned by the cloud service at allocation time. The third, hybrid mapping, combines static templates with dynamic metadata to reduce missing information while preserving user‑specified policies. The authors evaluate all three strategies across public (AWS, GCP), private (OpenStack), and hybrid cloud scenarios, reporting an average runtime overhead of less than 2 %.

Experimental validation uses two representative scientific workflows: an astronomy image‑processing pipeline handling large FITS files, and a neuroscience fMRI analysis pipeline. By varying the underlying cloud configurations (e.g., scaling CPU cores from 2 to 8, adjusting memory limits), the study demonstrates that infrastructure choices can change total execution time by up to 35 % and can even cause performance regressions when resources are insufficient, leading to task retries and data retransfers. These findings substantiate the claim that provenance must include infrastructure details to enable faithful reproduction.

ReCAP also provides an automated “environment recreation” module. Given a prior execution’s provenance record, the system provisions identical virtual resources, deploys the same container images and scripts, and re‑executes the workflow with the original input data. Because software versions and dependencies are captured alongside hardware specifications, the recreated run yields results indistinguishable from the original, eliminating a major source of reproducibility failure.

Overall, the paper makes three key contributions: (1) defining cloud‑aware provenance and integrating it with existing workflow provenance models; (2) presenting and empirically validating three low‑overhead mapping approaches for capturing infrastructure metadata; (3) demonstrating that the captured provenance can be used to automatically reconstruct the original execution environment, thereby achieving reproducible scientific results on the cloud.

Future work outlined includes extending ReCAP to multi‑cloud and serverless environments, incorporating provenance‑driven cost‑performance optimization, and aligning the provenance schema with emerging standards (e.g., W3C PROV) to facilitate broader adoption across scientific domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment