A Time-driven Data Placement Strategy for a Scientific Workflow Combining Edge Computing and Cloud Computing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Compared to traditional distributed computing environments such as grids, cloud computing provides a more cost-effective way to deploy scientific workflows. Each task of a scientific workflow requires several large datasets that are located in different datacenters from the cloud computing environment, resulting in serious data transmission delays. Edge computing reduces the data transmission delays and supports the fixed storing manner for scientific workflow private datasets, but there is a bottleneck in its storage capacity. It is a challenge to combine the advantages of both edge computing and cloud computing to rationalize the data placement of scientific workflow, and optimize the data transmission time across different datacenters. Traditional data placement strategies maintain load balancing with a given number of datacenters, which results in a large data transmission time. In this study, a self-adaptive discrete particle swarm optimization algorithm with genetic algorithm operators (GA-DPSO) was proposed to optimize the data transmission time when placing data for a scientific workflow. This approach considered the characteristics of data placement combining edge computing and cloud computing. In addition, it considered the impact factors impacting transmission delay, such as the band-width between datacenters, the number of edge datacenters, and the storage capacity of edge datacenters. The crossover operator and mutation operator of the genetic algorithm were adopted to avoid the premature convergence of the traditional particle swarm optimization algorithm, which enhanced the diversity of population evolution and effectively reduced the data transmission time. The experimental results show that the data placement strategy based on GA-DPSO can effectively reduce the data transmission time during workflow execution combining edge computing and cloud computing.

💡 Research Summary

The paper addresses the growing problem of data‑transfer latency in scientific workflows that run on hybrid edge‑cloud infrastructures. In such workflows, each task typically requires several large input datasets that are stored across multiple geographically dispersed cloud data centers. Transferring these datasets over the network can dominate the overall execution time, especially when bandwidth is limited or when the same data must be fetched repeatedly by different tasks. Edge computing can alleviate this issue by placing frequently accessed data closer to the compute nodes, thereby shortening the physical distance and reducing transmission delay. However, edge nodes have limited storage capacity, which makes it non‑trivial to decide which datasets should reside at the edge and which should remain in the cloud. The authors therefore formulate a data‑placement problem whose objective is to minimize the total data‑transfer time incurred during the execution of a scientific workflow, while respecting constraints on edge storage capacity and the heterogeneous bandwidths among cloud‑cloud, cloud‑edge, and edge‑edge links.

To solve this combinatorial optimization problem, the authors propose a hybrid meta‑heuristic called GA‑DPSO (Genetic‑Algorithm‑enhanced Discrete Particle Swarm Optimization). The baseline algorithm, discrete PSO (DPSO), treats each candidate placement as a binary vector (1 = store at edge, 0 = store in cloud) and updates particle positions using a velocity‑based rule adapted to the discrete domain. DPSO is known for fast convergence but suffers from premature convergence to local optima. To counteract this, the authors embed two classic genetic‑algorithm operators: a crossover operator that exchanges portions of two parent vectors to generate offspring, and a mutation operator that flips a small number of bits at random. These operators are applied periodically during the swarm evolution, injecting diversity and allowing the search to escape stagnation. The algorithm’s parameters— inertia weight, cognitive and social coefficients, crossover probability, and mutation probability— are tuned experimentally.

The problem model incorporates three key impact factors: (1) the bandwidth of each inter‑datacenter link, (2) the number of edge data centers available, and (3) the storage capacity of each edge data center. The objective function computes the transmission time for each required dataset as the dataset size divided by the bandwidth of the chosen path, summed over all workflow tasks. Additional constraints ensure that the total size of datasets assigned to any edge node does not exceed its capacity, and that each dataset is placed exactly once (either at an edge node or in the cloud). The model also accounts for data reuse: if a dataset is needed by multiple downstream tasks, storing it at the edge can avoid repeated transfers.

Experimental evaluation uses real scientific workflows (Montage, Epigenomics) and synthetic DAGs, combined with a range of network topologies. Scenarios vary the number of edge nodes (2–6), edge storage limits (10 GB–50 GB), and link bandwidths (e.g., 100 Mbps for edge‑edge, 500 Mbps for cloud‑edge). GA‑DPSO is compared against four baselines: (a) a traditional load‑balancing placement that distributes data evenly across a fixed number of data centers, (b) pure DPSO, (c) pure GA, and (d) a recent hybrid meta‑heuristic from the literature. Performance metrics include average total transmission time, worst‑case transmission time, number of iterations to convergence, and algorithm runtime.

Results show that GA‑DPSO consistently achieves the lowest transmission times. In bandwidth‑constrained settings with many edge nodes, the reduction reaches 20 %–35 % relative to the load‑balancing baseline. Pure DPSO converges quickly but often gets trapped in sub‑optimal placements, while pure GA explores broadly but converges slowly, leading to higher overall runtime. The hybrid approach inherits the rapid convergence of PSO and the diversity‑preserving capability of GA, delivering both high solution quality and reasonable computational overhead. Sensitivity analysis reveals that a crossover probability of 0.7–0.9 and a mutation probability of 0.1–0.2 provide the best trade‑off between exploration and exploitation.

The authors identify several avenues for future work. First, extending the method to handle dynamic workflows where data sizes or task graphs evolve at runtime would require an online or incremental version of GA‑DPSO. Second, incorporating energy consumption and monetary cost into the objective could produce a multi‑objective placement strategy that balances latency, power, and expense. Third, real‑world deployment on actual edge‑cloud platforms would validate the simulation results and expose practical issues such as security, privacy, and fault tolerance. Finally, integrating predictive models for data access patterns could further improve placement decisions by anticipating future demand.

In summary, this study contributes a rigorously modeled data‑placement problem for hybrid edge‑cloud scientific workflows, a novel GA‑enhanced discrete PSO algorithm that effectively mitigates premature convergence, and extensive experimental evidence that the proposed approach significantly reduces data‑transfer latency compared with existing strategies. The work advances the state of the art in workflow scheduling and resource management for emerging distributed computing environments.

A Time-driven Data Placement Strategy for a Scientific Workflow Combining Edge Computing and Cloud Computing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment