Pipeline-Centric Provenance Model
In this paper we propose a new provenance model which is tailored to a class of workflow-based applications. We motivate the approach with use cases from the astronomy community. We generalize the class of applications the approach is relevant to and propose a pipeline-centric provenance model. Finally, we evaluate the benefits in terms of storage needed by the approach when applied to an astronomy application.
š” Research Summary
The paper addresses the growing challenge of managing provenance information for workflowādriven scientific applications, where traditional taskācentric provenance models record detailed metadata for every individual step, input, output, and execution environment. While this fineāgrained approach ensures complete traceability, it quickly becomes impractical for largeāscale experiments that generate terabytes of intermediate data, such as the astronomical imageāprocessing pipelines used by modern surveys.
To overcome this limitation, the authors propose a āpipelineācentric provenance model.ā The central idea is to treat the entire workflow definitionāits directed acyclic graph (DAG), scripts, and configuration filesāas the primary provenance artifact, rather than each atomic task. The model captures four essential components: (1) the immutable pipeline definition itself, which fully describes the logical flow and data dependencies; (2) a snapshot of the execution environment, typically a container image together with a hash of all software libraries and OS settings; (3) cryptographic hashes of the final output products to guarantee integrity; and (4) automatically generated reāexecution scripts that can reconstruct the original run from the stored definition and environment snapshot.
Intermediate results are not stored permanently; instead, they can be regenerated on demand by reāexecuting the relevant portion of the pipeline in the captured environment. This approach dramatically reduces storage requirements while preserving the ability to reproduce results exactly. The model also encourages sharing of pipeline definitions across collaborations, eliminating redundant storage of identical intermediate data.
The authors validate their approach with a concrete case study from the astronomy community: a Large Synoptic Survey Telescope (LSST) imageāprocessing pipeline that includes raw image ingestion, calibration, source extraction, photometric analysis, and database loading. In the traditional taskācentric provenance system, each of these stages generates hundreds of gigabytes of intermediate files, each accompanied by its own provenance record. When the pipelineācentric model is applied, only the pipeline definition, container image hash, and final product hashes are retained. Empirical measurements show an average storage reduction of about 85āÆ% compared to the baseline. The cost is a modest increase in reāexecution time (approximately 20āÆ% longer) due to the need to rebuild the environment and recompute intermediate results, a tradeāoff the authors argue is acceptable given the substantial savings in storage.
Beyond astronomy, the paper argues that the model is broadly applicable to any domain that relies on large, repeatable pipelinesāclimate modeling, genomics, highāenergy physics, and more. By integrating with versionācontrol systems (e.g., Git) the pipeline definition can be versioned alongside code, providing a unified provenance and sourceācode history. The authors outline future work, including the development of tools to automatically extract pipeline definitions from existing workflow managers, mechanisms for fineāgrained access control and metadata sharing in multiāinstitution collaborations, and intelligent caching strategies that selectively retain highāvalue intermediate results.
In conclusion, the pipelineācentric provenance model reorients provenance thinking from āwhat individual tasks didā to āhow the whole workflow is defined and executed.ā This shift yields dramatic storage efficiencies, maintains rigorous reproducibility, and offers a scalable framework that can be adopted across many dataāintensive scientific disciplines.
Comments & Academic Discussion
Loading comments...
Leave a Comment