A Terminology for Scientific Workflow Systems
The term scientific workflow has evolved over the last two decades to encompass a broad range of compositions of interdependent compute tasks and data movements. It has also become an umbrella term for processing in modern scientific applications. Today, many scientific applications can be considered as workflows made of multiple dependent steps, and hundreds of workflow management systems (WMSs) have been developed to manage and run these workflows. However, no turnkey solution has emerged to address the diversity of scientific processes and the infrastructure on which they are implemented. Instead, new research problems requiring the execution of scientific workflows with some novel feature often lead to the development of an entirely new WMS. A direct consequence is that many existing WMSs share some salient features, offer similar functionalities, and can manage the same categories of workflows but also have some distinct capabilities. This situation makes researchers who develop workflows face the complex question of selecting a WMS. This selection can be driven by technical considerations, to find the system that is the most appropriate for their application and for the resources available to them, or other factors such as reputation, adoption, strong community support, or long-term sustainability. To address this problem, a group of WMS developers and practitioners joined their efforts to produce a community-based terminology of WMSs. This paper summarizes their findings and introduces this new terminology to characterize WMSs. This terminology is composed of fives axes: workflow characteristics, composition, orchestration, data management, and metadata capture. Each axis comprises several concepts that capture the prominent features of WMSs. Based on this terminology, this paper also presents a classification of 23 existing WMSs according to the proposed axes and terms.
💡 Research Summary
The paper addresses the growing bewilderment faced by scientists when choosing among the hundreds of scientific workflow management systems (WMSs) that have emerged over the past two decades. Recognizing that no single turnkey solution fits the diverse computational processes and heterogeneous infrastructures (HPC, cloud, edge) used today, the authors convened a community of WMS developers and practitioners (the Workflows Community Initiative, WCI) to devise a shared terminology that captures the essential capabilities of these systems without being tied to implementation details.
Five orthogonal axes are defined:
-
Workflow Characteristics – describes the nature of the workflows a system can handle, including whether execution is task‑driven or data‑driven, the coupling tightness of tasks, and dynamic features such as conditional branches or runtime interventions.
-
Composition – focuses on how workflows are expressed and assembled, covering description methods (scripts, APIs, GUIs), abstraction levels (flat, hierarchical, modular), and support for sub‑workflows or reusable components.
-
Orchestration – details the execution model, ranging from simple launchers to sophisticated event‑driven or cloud‑native orchestrators, and the scheduling strategies employed to allocate resources across distributed environments.
-
Data Management – captures data movement and storage strategies, including file‑based, streaming, or in‑memory transfers, storage locality (local, shared, distributed, replicated), granularity, and pipeline support.
-
Metadata Capture – enumerates auxiliary information collected during execution, such as provenance, performance metrics, anomaly detection, and overall workflow state tracking, which are crucial for reproducibility and debugging.
Each axis contains multiple sub‑terms that are not mutually exclusive; a single WMS can be described by a combination of these sub‑terms. Using this framework, the authors systematically classified 23 widely used WMSs (e.g., Nextflow, Pegasus, Apache Airflow, Swift/T, etc.). For each system they identified which terms apply, revealing patterns of overlap and differentiation that were previously obscured by ad‑hoc taxonomies.
The analysis demonstrates that the proposed terminology is more expressive than earlier classification schemes, allowing researchers to match their specific workflow requirements (e.g., need for dynamic branching, high‑throughput data handling, rich provenance) with the most suitable WMS and underlying execution environment. Moreover, because the terminology is community‑driven, it can evolve alongside emerging systems, ensuring long‑term relevance.
In summary, the paper contributes a robust, consensus‑based vocabulary for describing scientific workflow systems, provides a comprehensive classification of existing tools, and offers a practical decision‑making aid for scientists navigating the complex WMS landscape.
Comments & Academic Discussion
Loading comments...
Leave a Comment