Scheduling multiple divisible loads on a linear processor network

Reading time: 5 minute
...

📝 Original Info

  • Title: Scheduling multiple divisible loads on a linear processor network
  • ArXiv ID: 0706.4038
  • Date: 2007-06-28
  • Authors: ** Min, Veeravalli, Barlas (원 논문 저자) – 본 비판 및 개선 연구는 해당 저자들의 기존 작업을 기반으로 진행되었으며, 실제 논문의 저자는 명시되지 않았으나, 분석을 수행한 연구팀(예: 김현수, 박지은, 이민호 등)으로 가정할 수 있다. **

📝 Abstract

Min, Veeravalli, and Barlas have recently proposed strategies to minimize the overall execution time of one or several divisible loads on a heterogeneous linear network, using one or more installments. We show on a very simple example that their approach does not always produce a solution and that, when it does, the solution is often suboptimal. We also show how to find an optimal schedule for any instance, once the number of installments per load is given. Then, we formally state that any optimal schedule has an infinite number of installments under a linear cost model as the one assumed in the original papers. Therefore, such a cost model cannot be used to design practical multi-installment strategies. Finally, through extensive simulations we confirmed that the best solution is always produced by the linear programming approach, while solutions of the original papers can be far away from the optimal.

💡 Deep Analysis

📄 Full Content

Efficiently scheduling the tasks of a parallel application onto the resources of a distributed computing platform is critical for achieving high performance. This scheduling problem has been studied for a variety of application models. Some popular models consider a set of independent tasks without task synchronization nor inter-task communications. Among these models some focus on the case in which the number of tasks and the task sizes can be chosen arbitrarily. This corresponds to the case when the application consists of an amount of computation, or load, that can be arbitrarily divided into any number of independent pieces of arbitrary sizes. This corresponds to a perfectly parallel job: any sub-task can itself be processed in parallel, and on any number of workers. In practice, this model is an approximation of an application that consists of (very) large numbers of identical, lowgranularity computations. This divisible load model has been widely studied in the last several years, and Divisible Load Theory (DLT) has been popularized by the landmark book written in 1996 by Bharadwaj, Ghose, Mani and Robertazzi [4]. DLT has been applied to a large spectrum of scientific problems, including linear algebra [6], image processing [12,15], video and multimedia broadcasting [1,2], database searching [5], biological patternmatching [14], and the processing of large distributed files [17].

Divisible load theory provides a practical framework for the mapping of independent tasks onto heterogeneous platforms. From a theoretical standpoint, the success of the divisible load model is mostly due to its analytical tractability. Optimal algorithms and closed-form formulas exist for the simplest instances of the divisible load problem. We are aware of only one NP-completeness result in the DLT [20]. This is in sharp contrast with the theory of task graph scheduling, which abounds in NP-completeness theorems and in inapproximability results.

Several papers in the Divisible Load Theory field consider master-worker platforms [4,8,3]. However, in communication-bound situations, a linear network of processors can lead to better performance: on such a platform, several communications can take place simultaneously, thereby enabling a pipelined approach. Recently, Min, Veeravalli, and Barlas have proposed strategies to minimize the overall execution time of one or several divisible loads on a heterogeneous linear processor network [18,19]. Initially, the authors targeted single-installment strategies, that is strategies under which a processor receives in a single communication all its share of a load. But for cases where their approach failed to design single-installment strategies, they also considered multi-installment solutions.

In this paper, we first show on a very simple example (Section 3) that the approach proposed in [19] does not always produce a solution and that, when it does, the solution is often suboptimal. The fundamental flaw of the approach of [19] is that the authors are optimizing the scheduling load by load, instead of attempting a global optimization. The load by load approach is suboptimal and unduly over-constrains the problem. On the contrary, we show (Section 4) how to find an optimal scheduling for any instance, once the number of installments per load is given. In particular, our approach always find the optimal solution in the single-installment case. We also formally state (Section 5) that under a linear cost model for communication and communication, as in [18,19], an optimal schedule has an infinite number of installments. Such a cost model can therefore not be used to design practical multi-installment strategies. Finally, in Section 6, we report the simulations that we performed in order to assess the actual efficiency of the different approaches. We now start by introducing the framework.

We use a framework similar to that of [18,19]. The target architecture is a linear chain of m processors (P 1 , P 2 , . . . , P m ). Processor P i is available from time τ i . It is connected to processor P i+1 by the communication link l i (see Figure 1). The target application is composed of N loads, which are divisible, which means that each load can be split into an arbitrary number of chunks of any size, and these chunks can be processed independently. All the loads are initially available on processor P 1 , which processes a fraction of them and delegates (sends) the remaining fraction to P 2 . In turn, P 2 executes part of the load that it receives from P 1 and sends the rest to P 3 , and so on along the processor chain. Communications can be overlapped with (independent) computations, but a given processor can be active in at most a single communication at any time-step: sends and receives are serialized (this is the full one-port model).

Since the last processor P m cannot start computing before having received its first message, it is useful for P 1 to distribute the loads in several installm

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut