Using constraint programming to resolve the multi-source/multi-site data movement paradigm on the Grid
📝 Abstract
In order to achieve both fast and coordinated data transfer to collaborative sites as well as to create a distribution of data over multiple sites, efficient data movement is one of the most essential aspects in distributed environment. With such capabilities at hand, truly distributed task scheduling with minimal latencies would be reachable by internationally distributed collaborations (such as ones in HENP) seeking for scavenging or maximizing on geographically spread computational resources. But it is often not all clear (a) how to move data when available from multiple sources or (b) how to move data to multiple compute resources to achieve an optimal usage of available resources. We present a method of creating a Constraint Programming (CP) model consisting of sites, links and their attributes such as bandwidth for grid network data transfer also considering user tasks as part of the objective function for an optimal solution. We will explore and explain trade-off between schedule generation time and divergence from the optimal solution and show how to improve and render viable the solution’s finding time by using search tree time limit, approximations, restrictions such as symmetry breaking or grouping similar tasks together, or generating sequence of optimal schedules by splitting the input problem. Results of data transfer simulation for each case will also include a well known Peer-2-Peer model, and time taken to generate a schedule as well as time needed for a schedule execution will be compared to a CP optimal solution. We will additionally present a possible implementation aimed to bring a distributed datasets (multiple sources) to a given site in a minimal time.
💡 Analysis
In order to achieve both fast and coordinated data transfer to collaborative sites as well as to create a distribution of data over multiple sites, efficient data movement is one of the most essential aspects in distributed environment. With such capabilities at hand, truly distributed task scheduling with minimal latencies would be reachable by internationally distributed collaborations (such as ones in HENP) seeking for scavenging or maximizing on geographically spread computational resources. But it is often not all clear (a) how to move data when available from multiple sources or (b) how to move data to multiple compute resources to achieve an optimal usage of available resources. We present a method of creating a Constraint Programming (CP) model consisting of sites, links and their attributes such as bandwidth for grid network data transfer also considering user tasks as part of the objective function for an optimal solution. We will explore and explain trade-off between schedule generation time and divergence from the optimal solution and show how to improve and render viable the solution’s finding time by using search tree time limit, approximations, restrictions such as symmetry breaking or grouping similar tasks together, or generating sequence of optimal schedules by splitting the input problem. Results of data transfer simulation for each case will also include a well known Peer-2-Peer model, and time taken to generate a schedule as well as time needed for a schedule execution will be compared to a CP optimal solution. We will additionally present a possible implementation aimed to bring a distributed datasets (multiple sources) to a given site in a minimal time.
📄 Content
Computationally challenging experiments such as the one from the High Energy and Nuclear Physics community (HENP) have developed a distributed computing approach (a.k.a. Grid computing model) to face the massive needs of their Peta-scale experiments. The era of data intensive computing has surely opened a vast arena for computer scientists to resolve practical and exciting problems. One of such HENP experiments is the STAR 1 (Solenoidal Tracker at Relativistic Heavy Ion Collider) experiment located at the Brookhaven National Laboratory (USA).
In addition to a typical Peta-scale challenge and large computational needs, this experiment, as a running experiment acquires a new set of valuable real data every year, introducing other dimension of safe data transfer to the problem. From the yearly data sets, the experiment may produce many physics ready derived data sets which differ in accuracy as the problem is better understood as time passes. Thus, demands for a large-scaled storage management and efficient scheme to distribute data grows as a function of time, while on the other hand, end-users may need to access data sets from previous years and consequently at any point in time. Coordination is needed to avoid random access destroying efficiency.
The user’s task is typically embarrassingly parallel; that is, a single program can run N times on fraction of the whole data set split into N sub-parts without any impact on science reliability, accuracy, or reproducibility. For a computer scientist, the issue then becomes how to split the embarrassingly parallel task into N jobs in the most efficient manner while knowing the data set is spread over the world and/or how to spread ‘a’ dataset and best place the data for maximal efficiency and fastest processing of the task.
The purpose of this work is to design and develop an automated system that would efficiently use all available computational and storage resources. It will relieve end users of making decisions among possible ways of their task execution (which includes locating and transferring data to desired sites that appear optimal to user) while preserving fairness. Users’ knowledge of the whole system and data transfer tools will be reduced just to the communication with the future planner that will guarantee its decision to spread the task and data sets over chosen sites was, under current circumstances, the most efficient and optimal.
Rather than trying to solve the problem directly from a task scheduling perspective within a grid environment, we split the problem into several stages. By isolating data transfer/placement and computational challenges from each other we get an opportunity to study the behavior of both sets of constraints separately.
Individual tasks depend on a datasets which size has to be considered as well, since the time required for its staging and transfers is also significant. Therefore, the first milestone is to design and develop the data transfer planner/scheduler. For a given dataset needed at some site, its aim is to create a plan with an objective to prepare files from the dataset at a given site within the shortest time. The next requirement is to define and achieve fair share transfers within a multiuser environment. This means that if one user asked for a huge amount of data at some site, then another user who asked just for one file shouldn’t wait until the first user’s plan is finished.
The next milestone generalizes data transfer planning between sites. The goal for this stage is not to transfer files to one particular site, but do the transfer to several destinations. More precisely, the planner’s goal is to achieve presence of each file (from user’s input task) at one out of all possible destinations, while still having the objective in mind, to minimize the finish time of the last file transfer the user waits for.
The second milestone is highly corellated with the final milestone -scheduling the data transfers together with particular tasks (jobs) on a grid. The subtask is not finished after a file is transfered at some destination site, but when the user’s job executed at the same site (and dependent on this file) is finished. Thus, the planner still has the freedom of choosing a destination site for each file, but it has to consider that each site has a specific characteristic of its computational performance. These attributes include, for example, the number of available CPUs at current site or the actual load, so it can be more effective to transfer some files over the slower link to the computationally high performance site (or vice versa). The final objective is to minimize the finish time of the last user’s job. In this article we focus on the first milestone.
In the following part we will present a formal description of the problem and an approach based on Constraint Programming technique, used in artificial intelligence and operations research, where we search for assignment of given variables from t
This content is AI-processed based on ArXiv data.