The emergence of Cloud computing provides a new computing paradigm for scientific workflow execution. It provides dynamic, on-demand and scalable resources that enable the processing of complex workflow-based experiments. With the ever growing size of the experimental data and increasingly complex processing workflows, the need for reproducibility has also become essential. Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. One of the obstacles in reproducing an experiment execution is the lack of information about the execution infrastructure in the collected provenance. This information becomes critical in the context of Cloud in which resources are provisioned on-demand and by specifying resource configurations. Therefore, a mechanism is required that enables capturing of infrastructure information along with the provenance of workflows executing on the Cloud to facilitate the re-creation of execution environment on the Cloud. This paper presents a framework, ReCAP, along with the proposed mapping approaches that aid in capturing the Cloud-aware provenance information and help in re-provisioning the execution resource on the Cloud with similar configurations. Experimental evaluation has shown the impact of different resource configurations on the workflow execution performance, therefore justifies the need for collecting such provenance information in the context of Cloud. The evaluation has also demonstrated that the proposed mapping approaches can capture Cloud information in various Cloud usage scenarios without causing performance overhead and can also enable the re-provisioning of resources on Cloud. Experiments were conducted using workflows from different scientific domains such as astronomy and neuroscience to demonstrate the applicability of this research for different workflows.
The scientific community is experiencing a data deluge due to the generation of large amounts of data in modern scientific experiments that include projects such as the Laser Interferometer Gravitational Wave Observatory (LIGO) (Abramovici et al., 1992), the Large Hadron Collider (LHC) 1 , and projects such as neuGRID (Munir et al., 2014(Munir et al., , 2015)). In particular the neu-GRID community is utilising scientific workflows to orchestrate the complex processing of its data analysis. A large pool of compute and data resources are required to process this data, which has been available through the Grid (Foster and Kesselman, 1999) and is now also being offered by the Cloudbased infrastructures.
Cloud computing (Mell and Grance, 2011) has emerged as a new computing and storage paradigm, which is dynamically scalable and usually works on a pay-as-you-go cost model. It aims to share resources to store data and to host services transparently among users at a massive scale (Mei et al., 2008). Its ability to provide an on-demand computing infrastructure with scalability enables distributed processing of complex scientific workflows for the scientific community (Deelman et al., 2008). (Juve and Deelman, 2010) has experimented with Cloud infrastructures to assess the feasibility of executing workflows on the Cloud.
An important consideration during this data processing is to gather data that can provide detailed information about both the input and the pro-cessed output data, and the processes involved to verify and repeat a work-flow execution. Such a data is termed as Provenance in the scientific litera-ture. Provenance is defined as the derivation history of an object (Simmhan et al., 2005). This information can be used to debug and verify the execu-tion of a workflow, to aid in error tracking and reproducibility. This is of vital importance for scientists in order to make their experiments verifiable and repeatable. This enables them to iterate on the scientific method, to evaluate the process and results of other experiments and to share their own experiments with other scientists (Azarnoosh et al., 2013). The execution of scientific workflows in Clouds brings to the fore the need to collect provenance information, which is necessary to ensure the reproducibility of these experiments.
A research study (Belhajjame et al., 2012) conducted to evaluate the reproducibility of scientific workflows has shown that around 80% of the workflows cannot be reproduced, and 12% of them are due to the lack of information about the execution environment (Santana-P´erez and P´erez-Hern´andez, 2014). This information affects a workflow execution in multiple ways. A workflow execution can not be reproduced if the underlying execution environment does not provide the libraries (i.e. software dependencies) that are required for workflow execution. Besides the software dependencies, hardware dependencies related to an execution environment can also affect a workflow execution. It can affect a workflows overall execution performance and also job failure rate. This effect on the experiment performance has also been highlighted by Kanwal et al. (2015). For instance, a data-intensive job can perform better with 2GB of RAM because it can accommodate more data in RAM, which is a faster medium than hard disk. However, the job’s performance will degrade if a resource of 1GB RAM is allocated to this job as less data can be placed in RAM. Moreover, it is also possible that jobs will remain in waiting queues or fail during execution if their required hardware dependencies are not met. Therefore, it is important to collect the Cloud infrastructure or virtualization layer information along with the workflow provenance to recreate similar execution environment to ensure workflow reproducibility. However, capturing such an augmented prove-nance becomes more a challenging issue in the context of Cloud in which resources can be created or destroyed at runtime.
The Cloud computing presents a dynamic environment in which resources are provisioned on-demand. For this, a user submits resource configuration information as resource provision request to the Cloud infrastructure. A resource is allocated to the user if the Cloud infrastructure can meet the submitted resource configuration requirements. Moreover, the pay-as-you-go model in the Cloud puts constraints on the lifetime of a Cloud resource. For instance, one can acquire a resource for a lifetime but he has to pay for that much time. This means that a resource is released once a task is finished or payment has ceased. In order to acquire the same resource, one needs to know the configuration of that old resource. This is exactly the situation with repeating a workflow experiment on the Cloud. In order to repeat a workflow execution, a researcher should know the resource configurations used earlier in the Cloud. This enables him to re-provision similar resources and repeat workflow executio
This content is AI-processed based on open access ArXiv data.