Peter Altevogt∗, Andreas Linke

여기서부터 한글로 8~10줄 요약합니다.

다중 처리 시스템에서 동기식 몬테カル로 시뮬레이션을 구현할 때, 각 프로세서의 실제 자원을 고려한 동적 로드 밸런싱이 중요합니다. 이 논문에서는 다중 프로세서 시스템에서 동적으로 도메인 분할하여 동적 로드 밸런싱을 수행하는 알고리즘을 제안합니다. 이 알고리즘은 하향식 분할 방법을 사용하여 각 프로세서의 실제 자원을 고려하여 최적의 로드 밸런싱을 달성합니다. 이를 위해, 논문에서는 성능 모델을 통해 다중 프로세서 시스템에서 동기식 몬테カル로 시뮬레이션의 성능을 분석하고, 특정한 조건 underneath 하에서 알고리즘을 제안합니다. 또한 이 알고리즘은 2차원 이징 모형에 적용하여 실험 결과를 제공합니다.

다음으로 영어로 8~10줄 요약합니다.

Here is an 8-10 line summary in English:

We propose an algorithm for dynamic load balancing on heterogeneous multiprocessor systems. The algorithm uses a bottom-up partitioning method to dynamically allocate tasks based on actual processor resources. A performance model is used to analyze the performance of synchronous Monte Carlo simulations on multiple processors. Under specific conditions, the algorithm achieves optimal load balancing and provides experimental results with the two-dimensional Ising model.

The proposed algorithm consists of three main steps: input reading, dynamic partitioning of the domain, and communication between processors. The dynamic partitioning step is performed at regular intervals to adapt to changing processor resources. The communication step ensures that each processor has access to its assigned subdomain. Experimental results show significant improvements in performance compared to without load balancing.

The algorithm's performance model assumes that processor resources vary slowly compared to the time spent on one sweep, and uses a characteristic scale L of the lattice to measure processing resources. The model predicts that the optimal partitioning strategy is to divide the domain into subdomains with equal processing power. The proposed algorithm adapts this strategy to dynamic changes in processor resources.

The experimental results show that the proposed algorithm achieves significant performance improvements over without load balancing, especially for heterogeneous multiprocessor systems. The results are presented for the two-dimensional Ising model and demonstrate the effectiveness of the algorithm in achieving optimal load balancing under dynamic conditions.

Peter Altevogt∗, Andreas Linke

arXiv:hep-lat/9310021v1 18 Oct 1993An Algorithm forDynamic Load Balancingof Synchronous Monte Carlo Simulationson Multiprocessor SystemsPeter Altevogt∗, Andreas LinkeInstitute for Supercomputing and Applied Mathematics (ISAM)Heidelberg Scientific CenterIBM Deutschland Informationssysteme GmbHVangerowstr. 1869115 HeidelbergTel.

: 06221–59–4471GermanyAbstractWe describe an algorithm for dynamic load balancing of geometrically paral-lelized synchronous Monte Carlo simulations of physical models. This algorithmis designed for a (heterogeneous) multiprocessor system of the MIMD type withdistributed memory.

The algorithm is based on a dynamic partitioning of thedomain of the algorithm, taking into account the actual processor resources ofthe various processors of the multiprocessor system.Keywords: Monte Carlo Simulation; Geometric Parallelization; SynchronousAlgorithms; Dynamic Load Balancing; Dynamic ResizingIBM preprint 75.93.08, hep-lat/9310021∗altevogt@dhdibm1.bitnet1

1IntroductionDuring the last years, Monte Carlo simulations of scientific problems have turned outto be of outstanding importance [1, 2, 3, 4, 5]. It is now a general belief within thecommunity of computational scientists that multiprocessor systems of the MIMD1 typewith distributed memory are the most appropriate computer systems to provide thecomputational resources needed to solve the most demanding problems e.g.

in HighEnergy Physics, Material Sciences or Meteorology.Implementing a synchronous parallel algorithm on a heterogeneous MIMD system withdistributed memory (e.g. on a cluster of different workstations), a load balancing be-tween the processors of the system (taking into account the actual resources beingavailable on each node) turns out to be crucial, because the processor with the leastresources determines the speed of the complete algorithm.The heterogenity of the MIMD system may not only result because of heterogeneoushardware resources, but also due to a heterogeneous use of homogeneous hardware re-sources (e.g.

on a workstation cluster, there may exist several serial tasks running onsome of the workstations of the cluster for some time in addition to the parallel appli-cation; this results in a temporary heterogenity of the cluster, even if the workstationsof the cluster are identical). This kind of heterogenity can in general only be detectedduring the runtime of the parallel algorithm.Therefore, the usual approach of geometric parallelization [6, 7] to parallelize algo-rithms by a static decomposition of a domain into subdomains and associating eachsubdomain with a processor of the multiprocessor system is not appropriate for a het-erogeneous multiprocessor system.

Instead, on a heterogeneous system this geometricparallelization should be done dynamic.In the sequel, we consider geometrically parallelized Monte Carlo simulations consistingof update algorithms (e.g. Metropolis or heatbath algorithms) defined on e.g.

spins atthe sites of a lattice (e.g. in Ising models), matrices defined on the links of the lattice(e.g.

in Lattice Gauge Theories), etc.. In general, a synchronization of the parallelprocesses takes place after each sweep through the lattice (each iteration).For this class of simulations, we will introduce an algorithm for dynamic load balancing.Implementing and testing the algorithm for the two–dimensional Ising model, it willbe shown, that this algorithm may drastically improve the performance of the MonteCarlo simulation on a heterogeneous multiprocessor system.The paper consist of basically three parts.

In the first part we will introduce a sim-ple model for analyzing the performance of synchronous Monte Carlo simulations onmultiprocessor systems with distributed memory, in the second part we will presentour algorithm for the dynamic load balancing and finally we will present our numerical1Multiple Instruction stream, Multiple Data stream.For an excellent overview on the variousmodels of computation like SISD, SIMD, etc., see [8].2

results.2The Performance ModelWe consider a heterogeneous2 multiprocessor system consisting of n processors. Fora parallelized Monte Carlo simulation we measure3 at times tMC the times ∆ttMCithesimulation has taken for a fixed number of sweeps on each of the processors4.

Theparallelization is done by geometric parallelization, associating a sublattice i with acharacteristic scale LtMCi(e.g. a characteristic side of the sublattice, its volume, etc..)with each of the processors.

These scales are choosen such thatLtMCi= ctMCiL(1)holds. Here L denotes the scale of the complete lattice and the ctMCiare real numberswith 0 < ctMCi< 1 and Pi ctMCi= 15.Using these parameters, we can calculatequantities P tMCi, characterizing the computing resources of processor i at time tMC:P tMCi:= ctMCi∆ttMCi(2)Assuming, that the resources of the processors vary slowly compared with the time thesimulation takes for one sweepP tMC+1i∼P tMCi,(3)we setPi = P tMCi= const.

(4)Now we reinterpret formula (2): For fixed Pi we want to calculate a set of {ctMC+1i},such that the time for the next sweep (excluding the time spent for communication)6∆t({ci}) := maxi∆ti(ci)with∆ti(ci) := ciPi(5)for i = 1, ..., n has a minimal value2In the sense of the introduction.3Using e.g. a library routine provided by the operating system.4Here the times ∆ti denote “wall clock times” measured in seconds and tMC denotes the “internal”time of the simulation, measured in numbers of Monte Carlo sweeps.5The initial choice of the parameters LtMC=0irespectively of the ctMC=0iis quite arbitrary, onecould choose e.g.

ctMC=0i= 1n for all i.6From now on throughout this section we always consider the times ∆t, ∆ti, etc. and the coefficients{ci} to be taken at tMC+1.

For the sake of clarity we therefore drop this index throughout this section.3

∆tmin := min{ci} ∆t({ci}). (6)A necessary condition for this solution is obviously, that all ∆ti must be equal7.

Re-membering the normalization condition on the constants ci, we arrive atci =PiPi Pi,(7)with i = 1, ..., n. Using (5) this results in∆tmin =1Pi Pi. (8)For a homogenous system (all Pi equal) (7) would giveci = 1n(9)For a heterogenous system this choice of {ci} results in (using (5))∆t({ci = 1n}) = 1n1mini Pi.

(10)As the homogenity H of the multiprocessor system we define therefore the ratio of∆tmin with (10):H = nmini PiPi Pi(11)with 0 ≤H ≤1. The speedup S that can be obtained by the dynamic resizing of thesublattices is the inverse of H:S = 1H .

(12)Rewriting (8) in terms of H, we arrive at∆tmin = 1nHmini Pi,(13)For later comparison with our experimental data, let us look at the special case P :=P1 = ... = Pn−1 > Pn =: Pmin. In this case we have for ∆tmin:7Let us assume that e.g.

∆t1 > ∆t2. Then we could easily make ∆t1 smaller by a redefinition ofc1 and c2 with c1 + c2 = const.4

∆tmin = ∆tP n −Hn −1 ,(14)with ∆tP =1nP8. Without load balancing, the time for the total simulation will bedetermined by the time spent on processor n:∆tmax=1nPn=∆tPn −H(n −1)H=∆tmin1H .

(15)Figure 1 shows the “optimal” curve (using dynamic load balancing):∆tP∆tmin= n −1n −H(16)and the one obtained without any load balancing:∆tP∆tmax= n −1n −H H(17)for n = 4. These curves will be compared with our experimental results.min∆tP∆tmax0.10.20.30.40.50.60.70.80.910.20.30.40.50.60.70.80.91Homogenity∆tP∆tFigure 1: Performance predicted by our model for 4 processors with and without loadbalancing.8Using Pmin = (n−1)Hn−H P.5

3The AlgorithmIn this section we describe our algorithm to perform the dynamic load balancing, basedon the performance model described above.3.1The Input:- A characteristic scale L of the lattice (e.g. a side length of the lattice).- The number n of processors of the multiprocessor system.- The total number of iterations niter to be done by the simulation.- The number of iterations nresize after which a resizing of the sublattices may takeplace.- A control parameter ǫ with 0 < ǫ < 1 to determine if a resizing should be done.3.2The Output:- A dynamic resizing of the domains associated with each processor of the multi-processor system, taking into account the actual resources of the processors.3.3Formal Steps:1.

Read the input.2. Introduce characteristic scales LtMCiof the sublattices with i = 1, .., n and tMC =1, .., niter/nresize, where i denotes the processors and tMC counts the number ofresizings that have been done.3.

Calculate the initial characteristic sizes of the sublattices LtMC=0ifor all processorsaccording to LtMC=0i= Ln.4. Associate each of the sublattices with one of the processors.5.

Do on each processor i = 1, .., n (in parallel){- Set tMC = 0.- For m = 1, .., niter:{- Perform iteration of the Monte Carlo update algorithm on the sublat-tice9.9Possibly including communication with other processors.6

- If (m mod nresize) = 0 then{- Measure the wall–clock time ∆ttMCispent on processor i for doingthe calculations excluding the time spent for communications.- CalculateP tMCi:= LtMCi∆ttMCi(18)to measure the actual resources of each node of the multiprocessorsystem.- Communicate the results to all processors of the multiprocessor sys-tem.- Calculate the new characteristic sizes LtMC+1iof the sublattices withLtMC+1i=P tMCiPnn=1 P tMCnL(i = 1, .., n −1)(19)andLtMC+1n= L −n−1Xn=1LtMC+1n. (20)- Resize the sublattices if|LtMC+1i−LtMCi| > ǫL.

(21)This step may include the communication of parts of the sublatticesbetween the processors and is certainly the critical part of the algo-rithm. We will introduce an algorithm for this resizing for a specialcase below.- Set tMC = tMC + 1.

}}}In the sequel we present an algorithm for resizing the sublattices for the special case thatthe splitting of the sublattices takes place only in one dimension. We use the host–node(respectively client–server) parallel programming paradigm (see [7]), associating eachsublattice with a server process and leaving the more administration oriented tasks(like reading the global parameters of the simulation, starting the server processes,etc.) to the host process10.

Let us assume the size of the lattice in the direction ofthe splitting to be L and that the host process holds arrays a[i] with (i = 1, . .

. , n + 1)for the “Monte Carlo times” tMC and tMC −1 containing the first coordinate in thatdirection of the “slice” of the lattice associated with each processor:a[1] = 1 ≤a[2] ≤... ≤a[n + 1] = L + 1.

(22)10Of course these tasks could also be done by the server processes, resulting in the hostless paradigmof parallel programming. Therefore, our algorithm could also be implemented in a hostless model andour limitation to the host–node model is not a loss of generality.7

(In terms of the constants {ci} this would mean ci = a[i+1]−a[i]L.) Now the host processsends messages containing instructions to the node processes in two passes:1. For i = 2, .

. .

, n{if ((d := atMC[i] −atMC−1[i]) > 0){send message to server i telling it to send its “first” d slices to processori −1}}2. For i = n −1, n −2, .

. .

, 1{if (d := atMC[i + 1] −atMC−1[i + 1]) < 0{send message to server i telling it to send its “last” d slices to processori + 1}}The node processes wait for messages from either the host process or from neighbour-ing node processes. If there are not enough slices available on a node process to besent, the node process waits for a message from a neighbour node process to receiveadditional slices.

The two pass algorithm prevents deadlocks.If the resources of the processors of the multiprocessor system change very rapidly,a multiple communication of data may be necessary and will drastically reduce theefficiency of this algorithm. But this is consistent with the fact, that our completeapproach to dynamic load balancing is anyhow only valid for systems with moderatlyvarying resources, as was already pointed out at the beginning of section 2, see (3).4Results for the Two–Dimensional Ising ModelThe above described algorithm has been implemented for the parallelized simulationof the two–dimensional Ising model on a cluster of four IBM RISC System/6000 –550 workstations [7], using the PVM programming environment [9, 10].

Here we havea two–dimensional lattice which is divided into stripes. The objects defined on thelattice sites are spins (i.e.

binary variables) and an iteration defined on these objectsconsists e.g. of a Metropolis algorithm to generate a new spin configuration on thelattice.

Each stripe is associated with one workstation. The characteristic scales of the8

stripes are their widths and the characteristic scale of the lattice is the sum of all widths.The cluster being completely homogeneous, the heterogeneous situation has been sim-ulated by starting independent processes on one or several nodes of the cluster. Thisallows the heterogenity of the multiprocessor system to be introduced in a controlledmanner11, i.e.

to vary the homogenity H and measure (16) resp. (17) as functions of H.Our results are presented in figure 2 for a 1000 × 1000 and a 2000 × 2000 lattice.

Oneclearly sees the qualitative agreement with the prediction of our performance model,see figure 112.2000x2000 load balanced00.10.20.30.40.50.60.70.80.910.20.30.40.50.60.70.80.91Homogenity1000x1000 no load balancing2000x2000 no load balancing1000x1000 load balancedFigure 2: Performance measured for 4 processors with and without load balancing.A different point of view consists of looking at the (mega) updates done by the Metropo-lis algorithm on each spin per second (“MUps”)13. These are presented for a 1000×1000and a 2000 × 2000 lattice as a function of H in figures 3 and 4 with the dynamic loadbalancing being done after a certain number of sweeps.

It turns out, that the optimalnumber of sweeps between the load balancing to be performed depends on the size of11During the measurements cited below, the cluster has been dedicated to our application.12Considering the fact, that we have not included the time spent for communication in our model,a quantitative agreement between the theoretical and measured performance cannot be expected. Aninclusion of the communication in our model would be very difficult and highly system dependend,e.g.

because system parameters like latency and bandwidth may be be complicated functions of thehomogenity H.13These “MUps” constitute a benchmark for spin models.9

after 1 sweepafter 2 sweepsafter 3 sweepsafter 50 sweepsnoload balancing:01234560.20.30.40.50.60.70.80.9HomogenityMUpsFigure 3: Performance measured in MUps for 4 processors for a 1000 × 1000 latticewith and without load balancing being done after a certain number of sweeps.the problem.10

load balancing:012345670.20.30.40.50.60.70.80.9HomogenityMUpsafter 1 sweepnoafter 2 sweepsafter 3 sweepsafter 50 sweepsFigure 4: Performance measured in MUps for 4 processors for a 2000 × 2000 latticewith and without load balancing being done after a certain number of sweeps.11

5SummaryWe have introduced an algorithm for dynamic load balancing for synchronous MonteCarlo simulations on a heterogenous multiprocessor system with distributed memory.Implementing this algorithm for the two–dimensional Ising model, we have shown, thatit may result in a speedup of a factor 5 - 6 for the above described class of geometricallyparallelized algorithms. In many cases, the implementation of the algorithm is straightforward with only little overhead in calculation and communication.

For homogeneoussystems, almost no performance is lost because the algorithm detects that no resizingis necessary by applying (21). For systems with slowly changing heterogenity14, thealgorithm converges very fast and the requirements of the algorithm concerning thecomputing environment are minimal: the system only has to provide a routine to mea-sure the wall–clock time; such a routine should be available on all operating systems.Considering the generality of the algorithm introduced above, it may also be usefulapplied to problems other than Monte Carlo simulations, e.g.in parallel iterativemethods for solving linear or nonlinear equations appearing in engineering problems15.14compared to the time needed for one iteration (sweep)15Here the domain consists of a lattice, with a matrix element being associated with each of thenodes of the lattice.12

References[1] M.Creutz (ed. ), Quantum Fields on the Computer (World Scientific PublishingCo.

Pte. Ltd., Singapore 1992).

[2] K. Binder (ed. ), The Monte Carlo Method in Condensed Matter Physics, Topicsin Applied Physics, Vol.

71, (Springer–Verlag, Berlin, Heidelberg 1992). [3] K. Binder (ed.

), Monte Carlo Methods in Statistical Physics, Topics in CurrentPhysics, Vol. 7, 2nd edition (Springer–Verlag, Berlin, Heidelberg 1986).

[4] K. Binder, D.W.Heermann, Monte Carlo Simulation in Statistical Physics: AnIntroduction, Springer Series in Solid–State Sciences, Vol. 80 (Springer–Verlag,Berlin, Heidelberg 1988).

[5] J.P.Mesirov (ed. ), Very Large Scale Computation in the 21st Century[6] D.W. Heermann and A.N.

Burkitt, Parallel Algorithms in Computational Science,Springer Series in Information Sciences (Springer–Verlag, Berlin, Heidelberg 1991). [7] P. Altevogt, A. Linke, Parallelization of the Two–Dimensional Ising Model on aCluster of IBM RISC System/6000 Workstations, Parallel Comp.

19, Vol.9 (1993). [8] S.G. Akl, The Design and Analysis of Parallel Algorithms, Prentice–Hall Interna-tional Editions (Prentice–Hall, Inc.

1989). [9] A. Beguelin, J.J. Dongarra, G.A.

Geist, R.Manchek and V.S. Sunderam, A user’sguide to PVM parallel virtual machine.

Technical Report ORNL/TM-118 26, OakRidge National Laboratory, July 1991. [10] V.S.

Sunderam, PVM: A Framework for Parallel Distributed Computing, Concur-rency: Practice&Experience Vol.2 No.4, Dec. 1990.13


출처: arXiv:9310.021원문 보기

Subscribe to koineu.com

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe