Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs
Previous work has shown that there are two major complexity barriers in the synthesis of fault-tolerant distributed programs: (1) generation of fault-span, the set of states reachable in the presence of faults, and (2) resolving deadlock states, from…
Authors: Fuad Abujarad (Michigan State University), Borzoo Bonakdarpour (VERIMAG), S
L. Brim and J. van de Pol (Eds.): 8th International W orkshop on Parallel and Distrib uted Methods in verifiCation 2009 (PDMC’09) EPTCS 14, 2009, pp. 92–106, doi:10.4204/EPTCS.14.7 c F . Abujarad, B. Bonakdarpour & S. S. K ulkarni This work is licensed under the Creativ e Commons Attribution License. Parallelizing Deadlock Resolution in Symbolic Synthesis of Distrib uted Pr ograms ∗ Fuad Abujarad Department of Computer Science and Engineering Michigan State Uni versity East Lansing, MI 48824, USA abujarad@cse.msu.edu Borzoo Bonakdarpour VERIMA G Centre ´ Equation, 2 av e de V ignate 38610 Gi ` eres, France borzoo@imag.fr Sandeep S. Kulkarni Department of Computer Science and Engineering Michigan State Uni versity East Lansing, MI 48824, USA sandeep@cse.msu.edu Previous work has shown that there are two major complexity barriers in the synthesis of fault- tolerant distributed programs: (1) generation of fault-span, the set of states reachable in the presence of faults, and (2) resolving deadlock states, from where the program has no outgoing transitions. Of these, the former closely resembles with model checking and, hence, techniques for efficient verification are directly applicable to it. Hence, we focus on expediting the latter with the use of multi-core technology . W e present two approaches for parallelization by considering dif ferent design choices. The first approach is based on the computation of equiv alence classes of program transitions (called gr oup computation) that are needed due to the issue of distrib ution (i.e., inability of processes to atomically read and write all program variables). W e sho w that in most cases the speedup of this approach is close to the ideal speedup and in some cases it is superlinear . The second approach uses traditional technique of partitioning deadlock states among multiple threads. Ho we ver , our experiments sho w that the speedup for this approach is small. Consequently , our analysis demonstrates that a simple approach of parallelizing the group computation is likely to be the ef fecti ve method for using multi- core computing in the context of deadlock resolution. K eywords: Pr ogram transformation, Symbolic synthesis, Multi-core algorithm, Distributed pro- grams. 1 Intr oduction Gi ven the current trend in processor design where the number of transistors keeps growing as directed by Moore’ s law , but where clock speed remains relati vely flat, it is expected that multi-core computing will be the ke y for utilizing such computers most effecti v ely . As argued in [12], it is expected that programs and protocols from distributed computing will be especially beneficial in exploiting such multi-core computers. One of the crucial issues in distributed computing is fault-tolerance. Moreov er , as part of mainte- nance, it may be necessary to modify a program to add fault-tolerance to faults that were not considered in the original design. In such maintenance, it would be required that the existing functional properties of ∗ This work was partially sponsored by the COMBEST European project, NSF CNS 0914913 , and ONR Grant N00014-01- 1-0744. F . Ab ujarad, B. Bonakdarpour & S. S. Kulkarni 93 the program continue to be preserved during the addition of fault-tolerance, i.e., no b ugs should be intro- duced in such addition. For this reason, it would be highly beneficial if one could add such fault-tolerance properties using automated techniques. One dif ficulty in adding fault-tolerance using automated techniques, howe v er , is its complexity . In our previous work [4], we dev eloped a symbolic (BDD-based) algorithm for adding fault-tolerance to distributed programs specified in terms of transition system with state space larger than 10 30 . W e also identified a set of bottlenecks that compromise the effecti v eness of our algorithm. Based on the anal- ysis of the experimental results from [4], we observed that depending upon the structure of the gi v en distributed intolerant program, performance of synthesis suf fers from two major complexity obstacles, namely generation of fault-span (i.e., the set of reachable states in the presence of faults) and r esolution of deadlock states . Our focus in this paper is to ev aluate effecti v eness of different approaches that utilize multi-core computing to reduce the time comple xity of adding f ault-tolerance to distributed programs. In particular , we focus on the second problem, i.e., resolution of deadlock states. Deadlock resolution is especially crucial in the context of dependable systems, as it guarantees that the synthesized fault-tolerant program meets its li veness requirements ev en in the presence of faults. A program may reach a deadlock state due to the fact that faults perturb program to a new state that was not considered in the fault-intolerant program. Or , it may reach a deadlock state, as some program actions are removed (e.g., because they violate safety in the presence of faults). T o resolve a deadlock state, we either need to provide reco very actions that allow program to continue its execution or eliminate the deadlock state by preventing the program ex ecution from reaching it. T o ev aluate the ef fecti veness of multi-core computing, we first need to identify bottleneck(s) where multi-core features can provide the maximum impact. T o this end, we present two approaches for par - allelization. The first approach is based on the distributed nature of the program being synthesized. In particular , when a new transition is added (respecti vely , remov ed), since the process ex ecuting it has only a partial vie w of the program v ariables, we need to add (respecti vely , remove) a gr oup of transitions based on the variables that cannot be read by the process. The second approach is based on partition- ing deadlock states among multiple threads. W e show that while in most cases the speedup of the first approach is close to the ideal speedup and in some cases it is superlinear the second approach provides a small performance benefit. Based on the analysis of these results, we argue that the simple approach that parallelizes the group computation is likely to provide maximum benefit in the context of deadlock resolution for synthesis of distributed programs. Contributions of the paper . Our contributions in this paper is as follo ws: • W e present two approaches for expediting resolution of deadlock states in automated synthesis of fault-tolerance. • W e analyze these approaches in ter ms of thre e classic e xamples from distributed computing: Byzantine agreement [15], agreement in the presence of both failstop and Byzantine faults, and token ring [3]. • W e discuss dif ferent design choices considered in these two approaches. Organization of the paper . The rest of the paper is organized as follows. In Section 2, we define distributed programs and specifications. W e illustrate the issues in volved in the synthesis problem in the context of Byzantine agreement in Section 3. W e present our two approaches, the corresponding experimental results and analysis in Sections 4 and 5. Finally , we discuss related work in Section 6 and conclude in Section 7. 94 Parallelizing Deadlock Resolution in Symbolic Synthesis of Distrib uted Programs 2 Pr ograms, Specifications and Problem Statement In this section, we define the problem statement for adding fault-tolerance. W e begin with a fault- intolerant program, say p , that is correct in the absence of faults. W e let p be specified in terms of its state space, S p , and a set of transitions, δ p ⊆ S p × S p . Whenever it is clear from the conte xt, we use p and its transitions δ p interchangeably . A sequence of states, h s 0 , s 1 , ... i (denoted by σ ) is a computation of p if f (1) ( ∀ j : 0 < j < length ( σ ) : ( s j − 1 , s j ) ∈ p ) , i.e., in each step of this sequence, a transition of p is ex ecuted, and (2) if the sequence is finite and terminates in s j then ∀ s 0 :: ( s j , s 0 ) 6∈ p (a finite computation reaches a state from where there is no outgoing transition). A special subset of S p , say S , identifies an in variant of p . By this we mean that if a computation of p begins in a state where S is true, then (1) S is true at all states of that computation and (2) the computation is corr ect . Since the algorithm for addition of f ault-tolerance be gins with a program that is correct in the absence of faults, we do not explicitly need the program specification in the absence of faults. Instead, the predicate S is used to determine states where the fault-tolerant program could reco ver in the presence of faults. The goal of an algorithm that adds fault-tolerance is to begin with a program p and its in v ariant S to deri ve a fault-tolerant program, say p 0 , and its in variant, say S 0 . Clearly , one additional input to such an algorithm is f , the class of faults to which tolerance is to be added. Faults are also specified as a subset of S p × S p . Note that this allows modeling of different types of faults, such as transients, Byzantine (see Section 3.1), crash faults, etc. Y et another input to the algorithm for adding fault-tolerance is a safety specification, say SPEC bt , that should not be violated in the presence of faults. W e let SPEC bt also be specified by a set of bad transitions, i.e., SPEC bt is a subset of S p × S p 1 . Thus, it is required that in the presence of faults, the program should not e xecute a transition from SPEC bt . No w we define the problem of adding fault-tolerance. Let the input program be p , in v ariant S , faults f , and safety specification SPEC bt . Since our goal is to add fault-tolerance only , we require that no ne w computations are added in the absence of faults. Thus, if the output after adding fault-tolerance is program p 0 and in v ariant S 0 , then S 0 should not include any states that are not in S ; without this restriction, p 0 can begin in a state from where the correctness of p is unknown. Likewise, if ( s 0 , s 1 ) is a transition of p 0 and s 0 ∈ S 0 then ( s 0 , s 1 ) must also be a transition of p ; without this restriction, p 0 will hav e new computations in the absence of faults. Also, if p 0 has no outgoing transition from state s 0 ∈ S 0 , then it must be the case that p also has no outgoing transitions from s 0 ; without this restriction, p 0 may deadlock in a state that had no correspondence with p . Additionally , p 0 should be fault-tolerant. Thus, during the computation of p 0 , if faults from f occur then the program may be perturbed to a state outside S 0 . Just like the in v ariant captured the boundary up to which the program can reach in the absence of faults, we can identify a boundary upto which the program can reach in the pr esence of faults. Let this boundary (denoted by fault-span) be T . Thus, if any transition of p or f begins in a state where T is true, then it must terminate in a state where T is true. Moreover , if p 0 is permitted to ex ecute for a long enough time without perturbation of a fault, then p 0 should reach a state where its in v ariant S 0 is true. Based on this discussion, we define the problem of adding fault-tolerance as follo ws: Problem statement 2.1 Giv en p , S , f and SPEC bt , identify p 0 and S 0 such that: • ( C 1 ) : Constraints on the inv ariant – S 0 6 = φ , – S 0 ⇒ S , 1 As shown in [14], permitting more general specifications can significantly increase the comple xity of synthesis. W e also showed that representing safety specification using a set of transitions is e xpressi ve enough for most practical programs. F . Abujarad, B. Bonakdarpour & S. S. Kulkarni 95 • ( C 2 ) : Constraints on transitions within inv ariant – ( s 0 , s 1 ) ∈ p 0 ∧ s 0 ∈ S 0 ⇒ (( s 1 ∈ S 0 ) ∧ ( s 0 , s 1 ) ∈ p ), – s 0 ∈ S 0 ∧ ( ∀ s 1 :: ( s 0 , s 1 ) 6∈ p 0 ) ⇒ ( ∀ s 1 :: ( s 0 , s 1 ) 6∈ p ) , and • ( C 3 ) There exists T such that – S 0 ⇒ T , – s 0 ∈ T ∧ ( s 0 , s 1 ) ∈ ( p 0 ∪ f ) ⇒ s 1 ∈ T ∧ ( s 0 , s 1 ) 6∈ SPEC bt – s 0 ∈ T ∧ h s 0 , s 1 , ... i is a computation of p 0 ⇒ ( ∃ j : 0 < j < length ( h s 0 , s 1 , ... i ) : s j ∈ S 0 ) 3 Issues in A utomated Synthesis of F ault-T olerant Pr ograms In this section, we use the example of Byzantine agr eement [15] (denoted BA ) to describe the issues in automated synthesis of fault-tolerant programs. T ow ards this end, in Section 3.1, we describe the inputs used for synthesizing the Byzantine agreement problem. Subsequently , in Section 3.2, we identify the need for explicit modeling of read-write restrictions imposed by the nature of the distributed program. Finally , in Section 3.3, we describe ho w deadlock states get created while re vising the program for adding fault-tolerance and illustrate our approach for managing them. 3.1 Input f or Byzantine Agreement Pr oblem The Byzantine agreement problem ( BA ) consists of a general , say g , and three (or more) non-general processes, say j , k , and l . The agreement problem requires a process to copy the decision chosen by the general (0 or 1) and finalize (output) the decision (subject to some constraints). Thus, each process of B A maintains a decision d ; for the general, the decision can be either 0 or 1, and for the non-general processes, the decision can be 0, 1, or ⊥ , where the value ⊥ denotes that the corresponding process has not yet recei ved the decision from the general. Each non-general process also maintains a Boolean v ariable f that denotes whether that process has finalized its decision. For each process, a Boolean v ariable b shows whether or not the process is Byzantine; the read/write restrictions (described in Section 3.2), ensure that a process cannot determine if other processes are Byzantine. A Byzantine process can output different decision to different processes. Thus, a state of the program is obtained by assigning each variable, listed below , a value from its domain. And, the state space of the program is the set of all possible states. V = { d . g } ∪ (the general decision v ariables): { 0, 1 } { d . j , d . k , d . l } ∪ (the processes decision v ariables): { 0, 1, ⊥} { f . j , f . k , f . l } ∪ (finalized?): { false , true } { b . g , b . j , b . k , b . l } . (Byzantine?): { false , true } F ault-intolerant program. T o concisely describe the transitions of the (fault-intolerant) version of B A , we use guarded commands of the form g − → st , where g is a predicate inv olving the above program v ariables and st updates the above program variables. The command g − → st corresponds to the set of transitions { ( s 0 , s 1 ) : g is true in s 0 and s 1 is obtained by executing s t in state s 0 } . Thus, the transitions of a non-general process, say j , is specified by the follo wing two actions: BA intol j :: BA 1 j :: ( d . j = ⊥ ) ∧ ( f . j = false ) ∧ ( b . j = false ) − → d . j : = d . g BA 2 j :: ( d . j 6 = ⊥ ) ∧ ( f . j = false ) ∧ ( b . j = false ) − → f . j : = true 96 Parallelizing Deadlock Resolution in Symbolic Synthesis of Distrib uted Programs W e include similar transitions for k and l as well. Note that the general does not need explicit actions; the action by which the general sends the decision to j is modeled by BA 1 j . Specification. The safety specification of B A requires validity and agr eement . V alidity requires that if the general is non-Byzantine, then the final decision of a non-Byzantine, non-general must be the same as that of the general. Additionally , agr eement requires that the final decision of any two non-Byzantine, non-generals must be equal. Finally , once a non-Byzantine process finalizes (outputs) its decision, it cannot change it. F aults. A fault transition can cause a process to become Byzantine, if no other process is initially Byzantine. Also, a fault can arbitrarily change the d and f values of a Byzantine process. The fault transitions that af fect a process, say j , of BA are as follo ws: (W e include similar actions for k , l , and g ) F 1 :: ¬ b . g ∧ ¬ b . j ∧ ¬ b . k ∧ ¬ b . l − → b . j : = true F 2 :: b . j − → d . j , f . j : = 0 | 1 , false | true where d . j : = 0 | 1 means that d . j could be assigned either 0 or 1. In case of the general process, the second action does not change the v alue of any f -v ariable. Goal of automated Addition of fault-tolerance. Gi ven the set of faults ( F 1& F 2), the goal of a syn- thesis algorithm is to start from the intolerant program ( BA intol j ) and generate the fault-tolerant program ( B A tolerant j ): BA t ol er an t j :: BA 1 j :: ( d . j = ⊥ ) ∧ ( f . j = false ) ∧ ( b . j = false ) − → d . j : = d . g BA 2 j :: ( d . j 6 = ⊥ ) ∧ ( f . j = false ) ∧ ( d . j = d . l ∨ d . j = d . k ) − → f . j : = true BA 3 j :: ( d . l = 0 ) ∧ ( d . k = 0 ) ∧ ( d . j = 1 ) ∧ ( f . j = 0 ) − → d . j , f . j : = 0 , 0 | 1 BA 4 j :: ( d . l = 1 ) ∧ ( d . k = 1 ) ∧ ( d . j = 0 ) ∧ ( f . j = 0 ) − → d . j , f . j : = 1 , 0 | 1 In the above program, the first action is identical to that of the intolerant program. The second action is restricted to execute only in the states where another process has the same d value. Actions (3&4) are for fixing the process decision through appropriate recov ery . 3.2 Group Computation: The Need for Modeling Read/Write Restrictions A process in a distrib uted program has a partial view of the program variables. For e xample, in the con- text of the Byzantine agreement example from Section 3.1, process j is allowed to read R j = { b . j , d . j , f . j , d . k , d . l , d . g } and it is allo wed to write W j = { d . j , f . j } . Observ e that this modeling pre vents j from knowing whether other processes are Byzantine. W ith such read/write restriction, if process j were to include an action of the form ‘if b . k is true then change d . j to 0’ then it must also include a transition of the form ‘if b . k is false then change d . j to 0’. In general, if transition ( s 0 , s 1 ) is to be included as a transition of process j then we must also include a corresponding equiv alence class of transitions (called gr oup of transitions) that differ only in terms of v ariables that j cannot read. The same mechanism has to be applied for removing transitions as well. More generally , let j be a process, let R j (respecti vely , W j ) be the set of variables that j can read (respecti vely write), where W j ⊆ R j , and let v a ( s 0 ) denote the v alue of variable v a in the state s 0 . Then if ( s 0 , s 1 ) is a transition that is included as a transition of j then we must also include the corresponding equi valence class of transitions of the form ( s 2 , s 3 ) where s 0 and s 2 (respecti vely s 1 and s 3 ) are indistin- guishable for j , i.e., the y dif fer only in terms of the v ariables that j cannot read. This equi v alence class of transitions for ( s 0 , s 1 ) is giv en by the following formula: gr oup j (( s 0 , s 1 )) = W ( s 2 , s 3 ) ( V v 6∈ R j ( v ( s 0 ) = v ( s 1 ) ∧ v ( s 2 ) = v ( s 3 )) ∧ V v ∈ R j ( v ( s 0 ) = v ( s 2 ) ∧ v ( s 1 ) = v ( s 3 )) ) . F . Abujarad, B. Bonakdarpour & S. S. Kulkarni 97 3.3 Need f or Deadlock Resolution During synthesis, we analyze the effect of faults on the given fault-intolerant program and identify a fault-tolerant program that meets the constraints of Problem Statement 2.1. This in volv es addition of new transitions as well as remov al of existing transitions. In this section, we utilize the Byzantine agreement problem to illustrate how deadlocks states get created during the execution of the synthesis algorithm and identify two general approaches for resolving them (be them sequential or parallel). • Deadlock scenario 1 and use of recovery actions. One legitimate state, say s , for the Byzantine agreement program is a state where all processes are non-Byzantine, d . g is 0 and the decision of all non-generals is ⊥ . In this state, the general has chosen the value 0 and no non-general has recei ved any v alue. From this state, the general can become Byzantine and change its v alue from 0 to 1 arbitrarily . Hence, a non-general can recei ve either 0 or 1 from the general. Clearly , starting from s , in the presence of faults ( F 1 & F 2), the program ( B A intol ) can reach a state, say s 1 , where d . g = d . j = d . k = 0 , b . g = t r ue , d . l = 1 , f . l = 0. From such a state, transitions of the fault-intolerant program violate agreement, if they allow j (or k ) and l to finalize their decision. If we remov e these safety violating transitions then there are no other transitions from state s 1 . In other words, during synthesis, we encounter that state s 1 is a deadlock state. One can resolve this deadlock state by simply adding a r ecovery transition that changes d . l to 0. • Deadlock scenario 2 and need f or elimination. Again, consider the ex ecution of the program ( B A intol ) in the presence of faults ( F 1 & F 2) starting from state s in the previous scenario. From s , the program can also reach a state, say s 2 , where d . g = d . j = d . k = 0 , b . g = t rue , d . l = 1 , f . l = 1; state s 2 dif fers from s 1 in the previous scenario in terms of the v alue of f . l . Unlike s 1 in the previous scenario, since l has finalized its decision, we cannot resolve s 2 by adding safe recovery . Since safe recov ery from s 2 cannot be added, the only choice for designing a fault-tolerant program is to ensure that state s 2 is never reached in the fault-tolerant program by removing transitions that reach s 2 using backward reachability analysis. Howe ver , remov al of such transitions can potentially create more deadlock states that hav e to be eliminated. T o maximize the success of synthesis algorithm, our approach to handle deadlock states is as follows: Whene ver possible, we add recov ery transition(s) from the deadlock states to a le gitimate state. Howe ver , if no recov ery transition(s) can be added from a deadlock state, we try to eliminate it by preventing the program from reaching the state. In this paper , we utilize parallelism to expedite these two aspects of deadlock resolution: adding recov ery and eliminating deadlock states. 4 A ppr oach 1: Parallelizing Group Computation In this section, we present our approach for parallelizing group computation to expedite synthesis of fault-tolerant programs. First, in Section 4.1, we identify different design choices in devising our parallel algorithm. Then, in Section 4.2, we describe our approach for parallelizing the group computation. In Section 4.3, we provide e xperimental results. Finally , in Section 4.4, we analyze the e xperimental results to e valuate the ef fectiveness of parallelization for group computation. 4.1 Design Choices The structure of the group computation permits an ef ficient way to parallelize it. In particular , whenever some recov ery transitions are added for dealing with a deadlock state or some states are removed for 98 Parallelizing Deadlock Resolution in Symbolic Synthesis of Distrib uted Programs ensuring that a deadlock state is not reached, we can utilize multiple threads in a master -sla ve f ashion to expedite the group computation.The context of our approach targets multi-processor/core shared mem- ory infrastructure. Although we did not specifically analyze the influence of local memory sharing on the performance, we expect our solution to giv e similar results when it uses multi-core or multi-processor architecture. During the analysis for utilizing multiple cores ef fectiv ely , we make the following observa- tions/design choices. • Multiple BDD managers versus reentrant BDD package. W e chose to utilize dif ferent in- stances of BDD packages for each thread. Thus, at the time of group computation, each thread obtains a copy of the BDD corresponding to the reco very transitions being added. In part, this is motiv ated by the fact that existing parallel implementations hav e shown limited speedup (cf. Section 6). Also, we argue that the increased space complexity of this approach is acceptable in the context of synthesis, since the time complexity of the synthesis algorithm is high (as opposed to model checking) and we often run out of time before we run out of space. • Synchr onization overhead. The group computation is rather fine-grained, i.e., the time to com- pute a group of recovery transitions that are to be added to an input program is small (100-500ms on a normal machine). Hence, the o verhead of creating multiple threads needs to be small. W ith this motiv ation, our algorithm creates the required set of threads up front and utilizes mutexes to synchronize them. This synchrnozation provides a significant benefit ov er creating and destroying threads for each group operation. • Load balancing. Load balancing among se veral threads is desirable so that all threads take approximately the same amount of time in performing their task. T o perform a group computa- tion for recov ery transitions being added, we need to e v aluate the effect of read/write restrictions imposed by each process. A static way to parallelize this is to let each thread compute the set of transitions caused by read/write restrictions of a (gi ven) subset of processes. A dynamic way is to consider the set of processes for which a group computation is to be performed as a shar ed pool of tasks and allow each thread to pick one task after it finishes the pre vious one. W e find that given the small duration of each group computation, static partitioning of the group computation works better than dynamic partitioning since the ov erhead of dynamic partitioning is high. 4.2 Algorithm Description Based on these design choices, the algorithm consists of three parts: initialization, assignment of tasks to worker threads and computation of group with w orker threads. Initialization. In the initialization phase, the master thread creates all required worker threads by calling the algorithm InitiateThreads (cf. Algorithm 1). These threads stay idle until a group computation is required and terminate when the synthesis algorithm ends. Due to the design choice for load balancing, the algorithm distributes the w ork load among the av ailable threads statically (Lines 4-8). Then, it creates all the required worker threads (Line 10). T asks for worker thread. Initially , the algorithm Wo rk erThread (cf. Algorithm 2) locks the mutexes mutexStart and mutexStop (Lines 1-2). Then, it waits until the master thread unlocks the mutexStart mutex (Line 5). At this point, the worker thread starts computing the part of the group associated with this thread. This section of Wo rk erThread (Lines 7-15) is similar to the computing groups in the sequential setting except rather than finding the group for all the processes, the W o rkerThread algorithm finds the group for a subset of processes (Line 8). The function AllowWrite relaxes a predicate with respect to the F . Abujarad, B. Bonakdarpour & S. S. Kulkarni 99 Algorithm 1 InitiateThreads Input: noOfPr ocesses , noOfThr eads . 1: if noOfPr ocesses < noOfThreads then 2: retur n ERR OR; 3: end if 4: f or i : = 0 to noOfThreads − 1 do 5: BDDMgr [ i ] = Clone ( masterBDDManager ) ; 6: startP [ i ] := b i × noOfProcesses noOfThreads c ; 7: endP [ i ] := b ( i + 1 ) × noOfProcesses noOfThreads c − 1; 8: end f or 9: f or thID := 0 to noOfThreads − 1 do 10: SpawnThr ead W orkerThr ead ( thID ) ; 11: end f or v ariables that the corresponding process is allowed to modify . The function T ransfer transfers a BDD from one manager to another manager . And, the function F indGr oup adds read restrictions to a group predicate. When the computation is completed, the work er thread notifies the master thread by unlocking the mutex mute xStop (Line 17). Algorithm 2 W orkerThread Input: thID . // Initial locking of the mute xes 1: mutex lock ( thData [ thID ] . mutexStart ) ; 2: mutex lock ( thData [ thID ] . mutexStop ) ; 3: while true do 4: // W aiting for signal fr om the master thr ead 5: mutex lock ( thData [ thID ] . mutexStart ) ; 6: gtr [ id ] := false ; 7: tPr ed := endP [ thID ] − startP [ thID ] + 1 ; 8: for i : = 0 to ( endP [ thID ] − startP [ thID ]) + 1 do 9: tPr ed [ i ] := thData [ thID ] . trans ∧ allowWrite [ i + startP [ thID ]] . T ransfer ( BDDMgr [ thID ]) ; 10: tPr ed [ i ] := F indGr oup ( tPr ed [ i ] , i , thID ) ; 11: end for 12: thData [ thID ] . r esult := false ; 13: for i : = 0 to ( endP [ thID ] − startP [ thID ]) + 1 do 14: thData [ thID ] . r esult := thData [ thID ] . result ∨ tPred [ i ] ; 15: end for 16: // T rigg ering the master thr ead that this thread is done 17: mutex unlock ( thData [ thID ] . mutexStop ) ; 18: end while T asks f or master thread. Gi ven transition set tr , the master thread copies tr to each instance of the BDD package used by the w orker threads (cf. Algorithm 3, Lines 3-5). Then it assigns a subset of group computation to the worker threads (Lines 6-8) and unlocks them. After the work er threads complete, the master thread collects the results and returns the group BDD associated with the input tr . 4.3 Experimental Results In this section, we describe the respective experimental results in the context of the Byzantine agreement (described in Section 3.1). Throughout this section, all experiments are run on a Sun Fire V40z with 4 dual-core Opteron processors and 16 GB RAM. The BDD representation of the Boolean formulae has been done using the C++ interface to the CUDD package de veloped at Univ ersity of Colorado [17]. Throughout this section, we refer to the original implementation of the synthesis algorithm (without 100 Parallelizing Deadlock Resolution in Symbolic Synthesis of Distrib uted Programs Algorithm 3 MasterThread Input: transitions set thisT r . Output: transition group gAll . 1: tr := thisT r ; 2: gAll := false ; 3: f or i : = 0 to NoOfThreads − 1 do 4: thr eadData [ i ] . trans := trans . T ransfer ( BDDMgr [ thID ]) ; 5: end f or // all idle thr eads to start computing the gr oup 6: f or i : = 0 to NoOfThreads − 1 do 7: mutex unlock ( thData [ i ] . mutexStart ) ; 8: end f or // W aiting for all thr eads to finish computing the gr oup 9: f or i : = 0 to NoOfThreads − 1 do 10: mutex lock ( thData [ i ] . mutexStop ) ; 11: end f or // Mer ging the r esults fr om all thr eads 12: f or i : = 0 to NoOfThreads − 1 do 13: gAll := gAll + thData [ i ] . r esults ; 14: end f or 15: retur n gAll ; parallelism) as sequential implementation. W e use X thr eads to refer to the parallel algorithm that utilizes X threads. W e would like to note that the synthesis time duration differs between the sequential implementation in this paper and the one in [4] due to other unrelated improv ements on the sequential implementation itself. Ho wev er , the sequential, and the parallel implementations dif fer only in terms of the modification described in Section 4.2. W e note that our algorithm is deterministic and the testbed is dedicated. Hence, the only non- deterministic factor in time for synthesis is synchronization among threads. Based on our observations and e xperience, this f actor has a negligible impact and, hence, multiple runs on the same data essentially reproduce the same results. 0" 10000" 20000" 30000" 40000" 50000" 60000" 10" 15" 20" 25" 30" 35" 40" 45" !"#$%&'( )*+,$&&$&( Sequen.al"" 2"Threads" 4"Threads" 8"Threads" 16"Threads" (a) Deadlock Resolution Time 0" 10000" 20000" 30000" 40000" 50000" 60000" 10" 15" 20" 25" 30" 35" 40" 45" !"#$%&'( )*+,$&&$&( Sequen.al"" 2"Threads" 4"Threads" 8"Threads" 16"Threads" (b) T otal Synthesis Time Figure 1: The time required to (a) resolve deadlock states and (b) to synthesize a fault-tolerant program for sev eral numbers of non-general processes of BA using sequential and parallel algorithms. The BA has a state space ≈ 4 ∗ 10 1 . 08 x and reachable state space ≥ 2 ∗ 10 0 . 78 x where x is the number of process. In Figure 1, we show the results of using the sequential approach versus the parallel approach (with F . Abujarad, B. Bonakdarpour & S. S. Kulkarni 101 multiple threads) to perform the synthesis. All the tests have shown that we gain a significant speedup. For example, in the case of 45 non-general processes and 8 threads we gain a speedup of 6 . 1 . W e can clearly see that the parallel 16-thread version is faster than the corresponding 8-threads version. This was surprising given that there are only 8 cores a v ailable. Ho we ver , upon closer observation, we find that the group computation that is parallelized using threads is fine-grained. Hence, when the master thread uses multiple slave threads for performing the group computation, the slav e threads complete quickly and therefore cannot utilize the av ailable resources to the full extent. Hence, creating more threads (than av ailable processors) can improve the performance further . 4.4 Group T ime Analysis In this section, we focus on the effecti v eness of the parallelization of group computation by considering the time taken for it in sequential and parallel implementation. T owards this end, we analyze the group computation time for sequential and parallel implementations in the context of three e xamples: Byzantine agreement, agreement in the presence of failstop and Byzantine faults, and token ring [3]. The results for these examples are included in T ables 1-3. The number of cores used is equal to the number of threads. T o understand the speedup gain provided by our algorithm in Section 4.3, we e v aluated the e xperi- mental results closely . As an example, consider the case of 32 B A processes. For sequential implemen- tation, the total synthesis time is 59 . 7 minutes of which 55 are used for group computation. Hence, the ideal completion time with 4 cores is 18.45 minutes (55 / 4 + 4 . 7). By comparison, the actual time taken in our experiment was 19 . 1 minutes. Thus, the speedup gained using this approach is close to the ideal speedup. In some cases, the speedup ratio is less than the number of threads. This is caused by the fact that each group computation takes a very small time and incurs an overhead for thread synchronization. Moreov er , as mentioned in Section 3.3, due to the overhead of load balancing, we allocate tasks of each thread statically . Thus, the load of dif ferent threads can be slightly unev en. W e also observe that the speedup ratio increases with the number of processes in the program being synthesized. This implies that the parallel algorithm will scale to larger problem instances. Sequential 2-threads 4-threads 8-threads No. of Reachable Group Gr oup Speedup Group Speedup Group Speedup Processes States Time Time Ratio Time Ratio Time Ratio 15 10 11 50 29 1.72 17 2.94 11 4.55 24 10 17 652 346 1.88 185 3.52 122 5.34 32 10 22 3347 1532 2.18 848 3.95 490 6.83 48 10 33 33454 14421 2.32 7271 4.60 3837 8.72 T able 1: Group computation time for Byzantine Agreement. An interesting as well as surprising observ ation is that when the state space is large enough then the speedup ratio is more than the number of threads. This beha vior is caused by the fact that with parallelization, each thread is working on smaller BDDs during the group computation. T o understand this behavior , we conducted experiments where we created the threads to perform the group computation and forced them to execute sequentially by adding extra synchronization. W e found that such pseudo- sequential run took less time than that used by a purely sequential run. 102 Parallelizing Deadlock Resolution in Symbolic Synthesis of Distrib uted Programs Sequential 2-threads 4-threads 8-threads No. of Reachable Group Group Speedup Group Speedup Group Speedup Processes States Time Time Ratio Time Ratio Time Ratio 10 10 10 53 24 2.21 23 2.30 30 1.77 15 10 15 624 319 1.96 175 3.57 174 3.59 20 10 20 4473 2644 1.69 1275 3.51 1128 3.97 25 10 25 26154 11739 2.23 6527 4.01 5692 4.59 T able 2: Group computation time for the Agreement problem in the presence of failstop and Byzantine faults. Sequential 2-threads 4-threads 8-threads No. of Reachable Group Group Speedup Group Speedup Group Speedup Processes States Time Time Ratio Time Ratio Time Ratio 30 10 14 0.32 0.15 2.12 0.10 3.34 0.12 2.75 40 10 19 0.84 0.36 2.34 0.22 3.84 0.23 3.59 50 10 23 1.82 0.68 2.68 0.39 4.66 0.42 4.37 60 10 28 3.22 1.22 2.63 0.67 4.80 0.64 5.01 70 10 33 5.36 1.91 2.80 1.06 5.05 0.86 6.23 80 10 38 7.77 2.94 2.64 1.53 5.09 1.23 6.30 T able 3: Group computation time for token ring. 5 A ppr oach 2: Alternativ e (Con ventional) A ppr oach A traditional approach for parallelization in the context of resolving deadlock states, say d s , would be to partition the deadlock states into multiple threads and allow each thread to handle the partition assigned to it. For example, we can partition ds using the partition predicates, prt i , 1 ≤ i ≤ n , such that W n i = 1 ( prt i ∧ ds ) = ds . Thus, if two threads are av ailable during synthesis of the Byzantine agreement program then we can let prt 1 = ( d . j = 0 ) and prt 2 = ( d . j 6 = 0 ) . Next, in Section 5.1, we discuss some of the design choices we considered for this approach. Subse- quently , we describe experimental results in Section 5.2. W e argue that for such an approach to work in synthesizing distributed programs, group computation must itself be parallelized. 5.1 Design Choices T o ef ficiently partition deadlock states among threads, one needs to design a method such that (1) dead- lock states are ev enly distrib uted among worker threads, and (2) states considered by different threads for elimination hav e a small ov erlap during backtracking. Regarding the first constraint, we can partition deadlock states based on values of some variable and ev aluate the size of corresponding BDDs by the number of minterms that satisfy the corresponding formula. Regarding the second constraint, we expect that the overhead for such a split is as high as it requires detailed analysis of program transitions. Hence, instead of satisfying this constraint, we choose to add limited synchronization among threads so that the ov erlap in the explored states by dif ferent threads is small. After partitioning, one thread would work independently as long as it does not af fect states visited by other threads. As discussed in Section 3.3, to resolve a deadlock state, each thread explores a part of the state space using backward reachability . Clearly , when states visited by two threads overlap, we hav e two options: (1) perform synchronization so that only one thread explores any state or (2) allo w two threads to e xplore the states concurrently and resolve an y inconsistencies that may be created. F . Abujarad, B. Bonakdarpour & S. S. Kulkarni 103 W e find that the first option by itself is very expensiv e/impossible due to the fact that with the use of BDDs, each thread explores a set of states specified by the BDD. And, since each thread begins with a set of deadlock states and performs backward reachability , there is a significant ov erlap among states explored by dif ferent threads. Hence, the first option is likely to essentially reduce the parallel run to a sequential run. For this reason, we focus on the second approach where each thread explored the states concurrently . (W e also use some heuristic based synchronization where we maintained a set of visited states that each thread checked before performing backward state exploration. This technique provided only a small performance benefit.) s 1 s 2 s 0 s 1 s 2 s 0 Before elimination After elimination Sequential s 1 s 2 s 0 s 1 s 2 s 0 s 1 s 2 s 0 s 1 s 2 s 0 Case 1 s 1 s 2 s 0 s 1 s 2 s 0 s 1 s 2 s 0 s 1 s 2 s 0 Case 2 Thr ead 1 Thread 2 Merged Fixed (a) (b) A state Eliminated state T o be considered for elimination Legend Figure 2: Inconsistencies raised by concurrency . Inconsistency Resolution. When threads explore states concurrently , some inconsistencies may be created. Next, we giv e a brief ov ervie w of the inconsistencies that may occur due to concurrent state exploration and manipulation by different threads and identify how we can resolve them. T ow ards this end, let s 1 and s 2 be two states that are considered for deadlock elimination and ( s 0 , s 1 ) and ( s 0 , s 2 ) be two program transitions for some s 0 . A sequential elimination algorithm, removes transitions ( s 0 , s 1 ) and ( s 0 , s 2 ) which causes s 0 to be a ne w deadlock state (cf. Figure 2.a). This in turn requires that state s 0 itself must be made unreachable. If s 0 is unreachable then including the transition ( s 0 , s 1 ) in the synthesized program is harmless. In fact, it is desirable since including this transition also causes other transitions in the corresponding group to be included as well. And, these grouped transitions might be useful in providing recov ery from other states. Hence, it puts ( s 0 , s 1 ) and ( s 0 , s 2 ) (and corresponding group) back into the program being synthesized and it continues to eliminate the state s 0 . Howe ver , when multiple worker threads, say th 1 and th 2 , run concurrently , some inconsistencies may be created. W e describe some of these inconsistencies and our approach to resolve them ne xt. Case 1. States s 1 and s 2 are in dif ferent partitions. Hence, th 1 eliminates s 1 which in turn removes the transition ( s 0 , s 1 ) , and, th 2 eliminates s 2 which removes the transition ( s 0 , s 2 ) (cf. Figure 2.b). Since each thread w orks on its own copy , neither thread tries to elim inate s 0 , as they do not identify s 0 as a deadlock state. Subsequently , when the master thread mer ges the results returned by th 1 and th 2 , s 0 becomes a ne w deadlock state which has to be eliminated while the group predicates of transitions ( s 0 , s 1 ) and ( s 0 , s 2 ) hav e been removed unnecessarily . In order to resolve this case, we re-introduce all outgoing transitions that start from s 0 and mark s 0 as a state that has to be eliminated in subsequent iterations. Case 2. Due to backtracking behavior of the elimination algorithm, it is possible that th 1 and th 2 consider common states for elimination. In particular, if th 1 considers s 1 and th 2 considers both s 1 and s 2 for elimination (cf. Figure 2.b), after merging the results, no new deadlock states are introduced. Ho wev er , ( s 0 , s 1 ) would be remov ed unnecessarily . In order to resolve this case, we collect all the states that worker threads f ailed to eliminate and replace all incoming transitions into those states. 104 Parallelizing Deadlock Resolution in Symbolic Synthesis of Distrib uted Programs 5.2 Experimental Results W e also implemented this approach for parallelization. The results for the problem of Byzantine agree- ment are as sho wn in T able 4. From these results, we notice that the improv ement in the performance was small. Sequential P arallel Elimination with 2-threads No. of Reachable Deadlock T otal Deadlock T otal Processes States Resolution Time Synthesis Time Resolution Time Synthesis Time 10 10 7 7 9 8 9 15 10 12 78 85 78 87 20 10 14 406 442 374 417 25 10 18 1,503 1,632 1,394 1,503 30 10 21 4,302 4,606 3,274 3,518 35 10 25 11,088 11,821 10,995 11,608 40 10 28 27,115 28,628 21,997 23,101 45 10 32 45,850 48,283 39,645 41,548 T able 4: The time required to synthesis tolerant program for several numbers of non-general processes of B A in sequential and by partitioning deadlock states using parallelism. 6 Related W ork Automated program synthesis and revision has been studied from various perspectiv es. Inspired by the seminal work by Emerson and Clarke [6], Arora, Attie, and Emerson [2] propose an algorithm for synthesizing fault-tolerant programs from CTL specifications. Their method, howe v er , does not address the issue of the addition of fault-tolerance to existing programs. Kulkarni and Arora [13] introduce enumerati ve synthesis algorithms for automated addition of f ault-tolerance to centralized and distrib uted programs. In particular , they show that the problem of adding fault-tolerance to distributed programs is NP-complete. In order to remedy the NP-hardness of the synthesis of fault-tolerant distributed programs and overcome the state explosion problem, we proposed a set of symbolic heuristics [4], which allowed us to synthesize programs with a state space size of 10 30 and beyond. Ebnenasir [5] presents a divide-and-conquer method for synthesizing failsafe fault-tolerant distributed programs. A failsafe program is one that does not need to satisfy its liv eness specification in the presence of faults. Thus, a respectiv e synthesis algorithm does not need to resolve deadlock states outside the in v ariant predicate. Moreov er , Ebnenasir’ s synthesis method resolves deadlock states inside the in variant predicate in a sequential manner . W e hav e also presented an approach [1] for utilizing multi-core technology in the design of self- stabilizing programs, i.e., a program that ensures that starting from an arbitrary state, it reco vers to a legitimate state. This work utilizes parallelization of group computation as well as another approach for expediting the design of stabilizing programs. Howe v er , due to the nature of the problem in volv ed, parallelization of group computation is more ef fective in deadlock resolution than in design of stabilizing programs [1]. Parallelization of symbolic reachability analysis has been studied in the model checking community from dif ferent perspecti ves. In [7, 8, 9], the authors propose solutions and analyze different approaches to parallelization of the saturation -based generation of state space in model checking. In particular , in [8], the authors show that in order to gain speedups in saturation-based parallel symbolic verification, one F . Abujarad, B. Bonakdarpour & S. S. Kulkarni 105 has to pay a penalty for memory usage of up to 10 times, that of the sequential algorithm. Other efforts range from simple approaches that essentially implement BDDs as two-tiered hash tables [16, 18], to sophisticated approaches relying on slicing BDDs [11] and techniques for workstealing [10]. Howe v er , the resulting implementations sho w only limited speedups. 7 Conclusion Summary . In this paper , we focused on improving the synthesis of fault-tolerant programs from their fault-intolerant version. W e focused on two approaches for expediting the performance of the synthesis algorithm by using multi-core computing. W e showed that the approach of partitioning deadlock states provides a small improv ement. And, the approach based on parallelizing the group computation – that is caused by distribution constraints of the program being synthesized– provides a significant benefit that is close to the ideal, i.e., equal to the number of threads used. Moreov er , the performance analysis sho ws that this approach is scalable in that if more cores were av ailable, our approach can utilize them ef fectiv ely . Lessons Learnt. As sho wn in [4], there are two main bottlenecks in synthesizing fault-tolerant pro- grams: gener ation of fault-span which is essentially a reachability problem that has been studied ex- tensi vely in the context of model checking and deadlock resolution that corresponds to adding recov ery paths from states reached in the presence of faults. The results in this paper sho w that a traditional ap- proach (Section 5) of partitioning deadlock states provides a small improvement. Ho we ver , it helped identify an alternati ve approach for parallelization that is based on the distribution constraints imposed on the program being synthesized. The performance improvement with the use of the distribution constraints is significant. In fact, for most cases, the performance was close to the ideal speedup. What this suggests is that for the task of deadlock resolution, a simple approach based on parallelizing the group computation (as opposed to a reentrant BDD package that permits mul tiple concurrent threads or partition of deadlock states etc.) that is caused due to distribution constraints will provide the biggest benefit in performance. Moreov er , the group computation itself occurs in ev ery aspect of synthesis where ne w transitions have to be added for recov ery or existing transitions ha ve to be removed for preventing safety violation or breaking c ycles that prev ent recovery to the in v ariant. Hence, the approach of parallelizing the group computation will be ef fectiv e in the synthesis of distrib uted programs. Impact. Automated synthesis has been widely belie ved to be significantly more comple x than au- tomated verification. When we ev aluate the complexity of automated synthesis of fault-tolerance, we find that it fundamentally include two parts: (1) analyzing the existing program and (2) transforming it to ensure that it meets the fault-tolerance properties. The first part closely resembles with program verification and techniques for ef ficient verification are directly applicable to it. What this paper sho ws is that the complexity of the second part can be significantly remedied by the use of parallelization in a simple and scalable fashion. Moreover , if we e v aluate the typical inexpensi ve technology that is cur - rently being used or is lik ely to be av ailable in near futur e , it is e xpected to be 2-16 core computers. And, the first approach used in this paper is expected to be the most suitable one for utilizing these multicore computers to the fullest extent. Also, since the group computation is caused by distribution constraints of the program being synthesized, as discussed in Section 5, it is guaranteed to be required ev en with other techniques for expediting automated synthesis. For e xample, it can be used in conjunction with the approach in Section 5 as well as the approach that utilizes symmetry among processes being synthesized. 106 Parallelizing Deadlock Resolution in Symbolic Synthesis of Distrib uted Programs Refer ences [1] F . Abujarad & S. S. K ulkarni (2009): Multicor e Constraint-Based A utomated Stabilization . In: International Symposium on Stabilization, Safety , and Security of Distributed Systems (SSS) . [2] A. Arora, P . C. Attie & E. A. Emerson (1998): Synthesis of F ault-T olerant Concurr ent Pr ograms . In: Princi- ples of Distributed Computing (PODC) . pp. 173–182. [3] A. Arora & S. S. Kulkarni (1998): Component Based Design of Multitolerant Systems . IEEE T ransactions on Software Engineering 24(1), pp. 63–78. [4] B. Bonakdarpour & S. S. Kulkarni (2007): Exploiting Symbolic T echniques in Automated Synthesis of Dis- tributed Pr ograms with Larg e State Space . In: IEEE International Conference on Distributed Computing Systems (ICDCS) . pp. 3–10. [5] A. Ebnenasir (2007): DiConic addition of failsafe fault-tolerance . In: Automated Software Engineering (ASE) . pp. 44–53. [6] E. A. Emerson & E. M. Clarke (1982): Using Branc hing T ime T emporal Logic to Synthesize Synchr onization Skeletons . Science of Computer Programming 2(3), pp. 241–266. [7] J. Ezekiel & G. L ¨ uttgen (2007): Measuring and evaluating parallel state-space explor ation algorithms . In: International W orkshop on Parallel and Distributed Methods in V erification (PDMC) . [8] J. Ezekiel, G. L ¨ uttgen & G. Ciardo (2007): P arallelising Symbolic State-Space Generators . In: Computer Aided V erification (CA V) . pp. 268–280. [9] J. Ezekiel, G. L ¨ uttgen & R. Siminiceanu (2006): Can Saturation be P arallelised? On the P arallelisation of a Symbolic State-Space Generator . In: International W orkshop on Parallel and Distributed Methods of V erification (PDMC) . pp. 331–346. [10] O. Grumberg, T . Heyman, N. Ifergan & A. Schuster (2005): Ac hieving speedups in distributed symbolic r eachability analysis thr ough asynchr onous computation . In: Correct Hardware Design and V erification Methods (CHARME) . pp. 129–145. [11] O. Grumberg, T . Heyman & A. Schuster (2006): A work-efficient distributed algorithm for r eachability analysis . Formal Methods in System Design (FMSD) 29(2), pp. 157–175. [12] Maurice Herlihy (2008): The Futur e of Distributed Computing: Renaissance or Reformation? In: T wenty- Sev enth Annual A CM SIGA CT -SIGOPS Symposium on Principles of Distrib uted Computing (PODC 2008) . [13] S. S. Kulkarni & A. Arora (2000): Automating the Addition of F ault-T olerance . In: Formal T echniques in Real-T ime and Fault-T olerant Systems (FTR TFT) . pp. 82–93. [14] S. S. K ulkarni & A. Ebnenasir (2005): Comple xity Issues in A utomated Synthesis of F ailsafe F ault-T olerance . IEEE T ransactions on Dependable and Secure Computing . [15] L. Lamport, R. Shostak & M. Pease (1982): The Byzantine Generals Problem . ACM Transactions on Pro- gramming Languages and Systems 4(3), pp. 382–401. [16] K. Milvang-Jensen & A. J. Hu (1998): BDDNO W: A parallel BDD package . In: Formal Methods in Com- puter Aided Design (FMCAD) . pp. 501–507. [17] F . Somenzi. CUDD: Colorado University Decision Diagram Package . http://vlsi.colorado.edu/ ~ fabio/CUDD/cuddIntro.html . [18] T . Stornetta & F . Brewer (1996): Implementation of an efficient parallel BDD package . In: Design automation (D A C) . pp. 641–644.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment