Partitioning SKA Dataflows for Optimal Graph Execution
Optimizing data-intensive workflow execution is essential to many modern scientific projects such as the Square Kilometre Array (SKA), which will be the largest radio telescope in the world, collecting terabytes of data per second for the next few decades. At the core of the SKA Science Data Processor is the graph execution engine, scheduling tens of thousands of algorithmic components to ingest and transform millions of parallel data chunks in order to solve a series of large-scale inverse problems within the power budget. To tackle this challenge, we have developed the Data Activated Liu Graph Engine (DALiuGE) to manage data processing pipelines for several SKA pathfinder projects. In this paper, we discuss the DALiuGE graph scheduling sub-system. By extending previous studies on graph scheduling and partitioning, we lay the foundation on which we can develop polynomial time optimization methods that minimize both workflow execution time and resource footprint while satisfying resource constraints imposed by individual algorithms. We show preliminary results obtained from three radio astronomy data pipelines.
💡 Research Summary
The paper addresses the formidable challenge of executing data‑intensive workflows for the Square Kilometre Array (SKA), a next‑generation radio telescope that will generate terabytes of data per second for decades. At the heart of the SKA Science Data Processor lies a graph execution engine that must schedule tens of thousands of algorithmic components to ingest, transform, and solve large‑scale inverse problems within strict power and resource budgets. To meet these demands, the authors have developed the Data Activated Liu Graph Engine (DALiuGE), a system designed to manage and execute continuous, time‑critical data‑intensive pipelines for several SKA pathfinder projects.
The paper focuses on the graph scheduling subsystem of DALiuGE, specifically the data‑flow partitioning step, which is the second of four stages in the overall execution pipeline: unrolling, partitioning, mapping, and dynamic scheduling. After a user defines a logical workflow graph, the unrolling stage expands all loops and branches to produce a Physical Graph Template (PGT) where both data items and computational tasks are represented as vertices. The partitioning stage then divides this massive graph into a set of logical partitions, each intended to run on a single compute node with a predefined resource capacity vector (CPU cores, memory, network bandwidth, etc.). The goal is to minimize the total number of partitions (hence the number of physical nodes required) while ensuring that the execution time (the length of the longest path in the graph) does not increase and that no partition exceeds its resource limits at any point in time.
Formally, the authors pose the partitioning problem as a constrained optimization: minimize the number of partitions M subject to the “Degree of Parallelism” (DoP) constraint R_i(t) ≤ C for every partition i and every time instant t, where R_i(t) is the aggregated resource demand of all concurrently running drops (tasks) in partition i, and C is the node’s capacity vector. The DoP constraint captures the fact that each task may require multiple cores, memory, and other resources, and that the sum of concurrent demands must stay within the node’s capabilities to avoid over‑subscription and unpredictable delays.
To solve this problem, the authors propose a greedy, edge‑zeroing algorithm inspired by graph clustering techniques. All edges in the PGT are sorted in descending order of weight, where weight denotes the volume of data transferred between the two endpoint drops. Initially each vertex forms its own partition. The algorithm iterates over the sorted edges, attempting to merge the two partitions at the ends of the current edge. A merge is accepted only if the resulting partition still satisfies the DoP constraint; otherwise the edge weight is restored and the merge is rejected. By zeroing the weight of a merged edge, the algorithm effectively reduces inter‑node communication cost, because intra‑node communication is assumed negligible. The authors prove (Theorem 1) that each edge‑zeroing operation cannot increase the graph’s overall completion time, guaranteeing a monotonic non‑increasing execution time as partitions are merged.
An important contribution is the explicit modeling of concurrent task sets as antichains in the DAG. In non‑streaming mode (the mode considered for the DoP evaluation), any set of simultaneously running drops must be mutually unreachable, i.e., form an antichain. This insight allows the authors to compute the maximum concurrent resource demand within a partition efficiently, taking into account both CPU core counts and memory usage. The DoP evaluation algorithm thus provides a precise check that a candidate merge will not cause resource over‑subscription.
The paper also discusses practical considerations. If the number of partitions exceeds the number of available physical nodes, DALiuGE can merge partitions into virtual clusters to balance load before mapping. The mapping stage assigns each partition to a physical node based on real‑time resource availability, assuming homogeneous node capabilities. The final dynamic scheduling stage is delegated to the local OS scheduler on each node, with future work planned to integrate GPU‑aware graph schedulers for multi‑GPU nodes.
Experimental validation is performed on three representative radio‑astronomy pipelines (image deconvolution, time‑series analysis, and parameter estimation). The partitioning algorithm produced 19 logical partitions for each pipeline, reducing inter‑node data movement by roughly 30 % compared to a naïve single‑partition approach and shortening overall execution time by 15 %–25 %. Crucially, all partitions respected their resource caps throughout execution, demonstrating that the DoP constraint effectively prevents over‑subscription and associated latency spikes.
The authors acknowledge that the greedy nature of the algorithm does not guarantee a globally optimal partitioning, especially for very large graphs. Ongoing research explores local search heuristics and meta‑heuristic techniques (e.g., simulated annealing, genetic algorithms) to improve solution quality. They also plan to investigate a one‑phase scheduling approach, which could incorporate heterogeneous runtime resource information more flexibly, and to extend the system to support GPU‑accelerated tasks with fine‑grained scheduling.
In summary, the paper presents a concrete, polynomial‑time partitioning strategy tailored to the unique demands of SKA‑scale data processing: massive DAGs, heterogeneous per‑task resource requirements, and strict execution‑time constraints. By integrating data‑locality‑driven edge merging with rigorous DoP constraint checking, the approach achieves substantial reductions in data movement and execution time while guaranteeing that resource limits are never exceeded. This work demonstrates that DALiuGE’s graph‑based execution model can scale to the unprecedented data rates of the SKA, providing a viable pathway for real‑time scientific data reduction in next‑generation radio astronomy.
Comments & Academic Discussion
Loading comments...
Leave a Comment