Rare Event Simulation for non-Markovian repairable Fault Trees

Rare ev en t sim ulation for non-Mark o vian repairable fault trees I Carlos E. Budde 1 , Marco Biagi 2 , Raúl E. Mon ti 1 , P edro R. D’Argenio 3 , 4 , 5 , and Mariëlle Sto elinga 1 , 6 1 F ormal Methods and T o ols, Universit y of T wen te, Ensc hede, the Netherlands {c.e.budde,r.e.monti,m.i.a.stoelinga}@utwente.nl 2 Departmen t of Information Engineering, Universit y of Florence, Florence, Italy marco.biagi@unifi.it 3 F AMAF, Universidad Nacional de Córdoba, Córdoba, Argentina dargenio@famaf.unc.edu.ar 4 CONICET, Córdoba, Argen tina 5 Departmen t of Computer Science, Saarland Univ ersity , Saarbrüc ken, Germany 6 Departmen t of Softw are Science, Radb oud Univ ersit y , Nijmegen, the Netherlands Abstract. Dynamic fault trees ( dft ) are widely adopted in industry to assess the dep endability of safet y-critical equipment. Since many sys- tems are to o large to be studied numerically , dft s dep endability is often analysed using Mon te Carlo simulation. A bottleneck here is that man y sim ulation samples are required in the case of rare even ts, e.g. in highly reliable systems where components fail seldomly . Rare ev ent sim ulation ( res ) pro vides techniques to reduce the num b er of samples in the case of rare even ts. W e present a res tec hnique based on imp ortance splitting, to study failures in highly reliable dft s. Whereas res usually requires meta-information from an expert, our method is fully automatic: By clev erly exploiting the fault tree structure we extract the so-called im- p ortance function. W e handle dft s with Mark ovian and non-Mark ovian failure and repair distributions—for which no numerical metho ds exist— and show the eﬃciency of our approach on sev eral case studies. 1 In tro duction Reliabilit y engineering is an imp ortant ﬁeld that provides methods and to ols to assess and mitigate the risks related to complex systems. F ault tree analy- sis ( ft a ) is a prominen t technique here. Its application encompasses a large n umber of industrial domains that range from automotiv e and aerospace system engineering, to energy and telecomm unication systems and proto cols. F ault trees. A fault tree ( ft ) describ es ho w comp onent failures occur and propagate through the system, even tually leading to system failures. T echnically , an ft is a directed acyclic graph whose leav es mo del component failures, and I This work was partially funded b y NW O, NS, and ProRail pro ject 15474 ( SE- QUOIA ), ER C grant 695614 ( PO WVER ), EU project 102112 ( SUCCESS ), ANPCyT PICT-2017-3894 ( RAFTSys ), and SeCyT pro ject 33620180100354CB ( ARES ). 1 whose other no des (called gates) mo del failure propagation. Using fault trees one can compute dep endability metrics to quantify how a system fares w.r.t. certain performance indicators. T wo common metrics are system r eliability —the probabilit y that there are no system failures during a given mission time—and system availability —the av erage p ercentage of time that a system is operational. Static fault tr e es (aka standard ft s) con tain a few basic gates, lik e AND and OR gates. This mak es them easy to design and analyse, but also limits their ex- pressivit y . Dynamic fault tr e es ( dft s [21, 52]) are a common and widely applied extension of standard ft s, catering for more complex dependability patterns, lik e spare managemen t and causal dep endencies. T o model these patterns, dft s come with additional gates, for instance SP ARE , P AND , and FDEP . Suc h gates mak e dft s more diﬃcult to analyse. In static ft s it only matters whether or not a comp onent has failed, so they can be analysed with Bo olean metho ds, suc h as binary decision diagrams [32]. Dynamic fault trees, on the other hand, crucially dep end on the failure order, so Boolean metho ds are insuﬃcien t. Moreo ver and on top of these t wo classes, r ep air able fault tr e es ( rft [7]) p ermit comp onen ts to be repaired after they ha ve failed. This is crucial to mo del fault- toleran t systems more realistically . Y et repairs make analyses even harder: it do es not suﬃce to kno w which comp onents failed, or in which order, but also if they are sim ultaneously failed. The general rule is that the more complex the formalism, the more realistic the mo del, and the harder the analyses. Fig. 2 is an rft with a top AND gate, a SP ARE ( Rcab ), and three leav es. F ault tree analysis. The reliability/a v ailabilit y of a fault tree can b e computed via n umerical metho ds, such as probabilistic model chec king. This inv olve s ex- haustiv e explorations of state-based models such as interactiv e Mark ov chains [48]. Since the num b er of states (i.e. system conﬁgurations) is exp onential in the n umber of tree elemen ts, analysing large trees remains a c hallenge to day [32, 1]. Moreo ver, n umerical methods are usually restricted to exp onen tial failure rates and com binations thereof, like Erlang and acyclic phase t yp e distributions [48]. Alternativ ely , fault trees can b e analysed using (standard) Mon te Carlo sim- ulation ( smc [26, 48, 45], aka statistical model c hecking). Here, a large n umber of simulated system runs ( samples ) is pro duced. Reliability and a v ailability are then statistically estimated from the resulting sample set. Suc h sampling do es not in volv e storing the full state space. Therefore, smc is muc h more memory eﬃcien t than n umerical techniques. F urhermore, smc is not restricted to expo- nen tial probability distributions. Ho w ever, a kno wn b ottlenec k of smc are rare ev ents: when the even t of interest has a low probabilit y (which is typically the case in highly reliable systems), millions of samples may be required to observe it. Pro ducing these samples can tak e a unacceptably long simulation time. Rare even t simulation. T o alleviate this problem, the ﬁeld of rare ev ent sim- ulation ( res ) pro vides techniques that reduce the num b er of samples [42]. The t wo leading tec hniques are imp ortance sampling and imp ortance splitting. Imp ortanc e sampling tw eaks the probabilities in a model, then computes the metric of in terest for the changed system, and ﬁnally adjusts the analysis results to the original model [28, 39]. As a simple example consider a coin whose prob- 2 abilit y of heads is p = 1 / 80 . W e can increase that probabilit y to p 0 = 1 / 8 , but coun t each o ccurrence of heads as 1 / 10 rather than as 1. This is t ypically denoted change of me asur e . Thus, if w e dra w n = 1000 samples with the increased prob- abilit y p 0 , and we see 67 heads coming up, we estimate the probabilit y on heads as 0 . 067 = 67 1000 · 1 10 . In the limit n → ∞ , the exp ected num b er of heads that come up is the same for the original and the t weak ed mo del (after the adjust- men t). Ho wev er, sample outcomes hav e a low er v ariance in the tw eak ed model, so statistical analyses con verge faster: few samples yield accurate estimations. Imp ortanc e splitting , deplo yed in this pap er, relies on rare ev ents that arise as a sequential combination of less rare intermediate even ts [34, 3]. W e exploit this fact by generating more (partial) samples on paths where suc h in termediate ev ents are observ ed. In the coin example, suppose w e ﬂip it eight times in a ro w, and sa y we are interested in observing at least three heads. If heads comes up at the ﬁrst ﬂip ( H ) then we are on a promising path. W e can then clone ( split ) the curren t path H , generating e.g. 7 copies of it, eac h copy ev olving indep enden tly from the second ﬂip onw ards. Say one of them observ es three heads—the cloned H plus tw o more. Then each observ ation of the rare even t (three heads) is coun ted as 1 / 7 rather than as 1, to accoun t for the splitting that spa wned the clone. No w, if a clone observ es a new head ( H H ), this is ev en more promising than H , so the splitting mechanism can be rep eated. If we make 5 copies of the H H clone, then observing the even t of interest in any of these copies coun ts as 1 35 = 1 7 · 1 5 . Alternativ ely , observing tails as second ﬂip ( H T ) is less promising than heads. One could then decide not to split suc h path. This example highlights a key ingredient of imp ortance splitting: the imp or- tanc e function , that indicates for eac h state ho w promising it is w.r.t. the even t of interest. This function, together with other parameters such as thresholds [23], are used to choose e.g. the num b er of clones spawned when visiting a state. An imp ortance function for our example could be the num b er of heads seen th us far. Another one could b e suc h num b er, multiplied by the n umber of coin ﬂips yet to come. The goal is to giv e higher imp ortanc e to states from whic h observing the r ar e event is mor e likely . The eﬃciency of an imp ortance splitting implementa- tion increases as the imp ortance function b etter reﬂects suc h prop erty . Rare ev ent sim ulation has b een successfully applied in several domains [41, 54, 58, 5, 6, 55]. How ev er, a key bottleneck is that it critically relies on exp ert kno wledge. In particular for imp ortance splitting, ﬁnding a goo d importance function is a w ell-known highly non-trivial task [42, 31]. Our con tribution: rare even t simulation for fault trees. This pap er pre- sen ts an imp ortance splitting metho d to analyse rft s. In particular, w e auto- matically deriv e an imp ortance function by exploiting the description of a system as a fault tree. This is crucial, since the imp ortance function is normally giv en man ually in an ad ho c fashion by a domain or res exp ert. W e use a v ariety of res algorithms based in our imp ortance function, to estimate system unre- liabilit y and unav ailability . Our approach can conv erge to precise estimations in increasingly reliable systems. This metho d has four adv antages o v er earlier analysis methods for rft s—which w e o verview in the related work section 7— 3 namely: (1) w e are able to estimate b oth the system reliabilit y and av ailability; (2) we can handle arbitrary failure and repair distributions; (3) w e can handle rare ev ents; and (4) w e can do it in a fully automatic fashion. T echnically , w e build lo cal importance functions for the (automata-seman tics of the) no des of the tree. W e then aggregate these lo cal functions in to an im- p ortance function for the full tree. Aggregation uses structural induction in the la yered description of the tree. Using our imp ortance function, we implemen t imp ortance splitting methods to run res analyses. W e implemen ted our theory in a full-stac k to ol chain. With it, we computed conﬁdence in terv als for the un- reliabilit y and unav ailability of sev eral case studies. Our case studies are rft s whose failure and repair times are gov erned b y arbitrary con tin uous probabilit y densit y functions ( pdf s). Each case study was analysed for a ﬁxed runtime bud- get and in increasingly resilient conﬁgurations. In all cases our approach could estimate the narro west in terv als for the most resilien t conﬁgurations. Organization of the paper. W e ﬁrst introduce the formal concepts used for our mathematical deﬁnitions in Secs. 2 and 3. Then, we detail our theory to implemen t res for repairable dft s with arbitrary pdf s in Sec. 4. F or that, Sec. 4.1 introduces our (comp ositional) imp ortance function, and Sec. 4.2 ex- plains ho w to embed it into an automated framework for Importance Splitting res . Next, Sec. 5 describ es how w e implement this theory in our to ol chain. In Sec. 6 we sho w an extensiv e exp erimen tal ev aluation that corroborates our exp ectations. W e ﬁnally ov erview related work in Sec. 7, and conclude our con- tributions in Sec. 8. 2 F ault tree analysis A fault tree ‘ 4 ’ is a directed acyclic graph that models ho w comp onent failures propagate and even tually cause the full system to fail. W e consider repairable fault trees (RFT s), where failures and repairs are gov erned b y arbitrary proba- bilit y distributions. BE 1 BE n (a) AND BE 1 BE n (b) OR k/n BE 1 BE n (c) VOT k BE 1 BE 2 (d) P AND S 1 S m P (e) SP ARE T BE 1 BE n (f ) FDEP BE 1 BE n (g) RBOX Fig. 1: F ault tree gates and the repair b ox Basic elements. The lea ves of the tree, called basic even ts or b asic elements ( BE s), mo del the failure of components. A BE b is equipp ed with a failure distri- bution F b that gov erns the probability for b to fail b efore time t , and a repair dis- tribution R b go verning its repair time. Some BE s are used as spare components: these ( SBE s) replace a primary component when it fails. SBE s are equipp ed also with a dormancy distribution D b , since spares fail less often when dormant , i.e. not in use. Only if an SBE b ecomes activ e, its failure distribution is giv en by F b . 4 Scop e Abbreviation Meaning General pdf Probabilit y densit y function ci Conﬁdence interv al ft a F ault T ree Analysis ft F ault T ree dft Dynamic F ault T ree rft Repairable (Dynamic) F ault T ree smc Standard Monte Carlo sim ulation res Rare Even t Sim ulation iosa Input/Output Sto chastic Automata with Urgency [19] T ree gates ( m inputs) AND Conjunction: m -ary AND OR Disjunction: m -ary OR V OT k V oting: k out of m P AND Priorit y AND SP ARE Spare: 1 primary BE , m - 1 spare BE s FDEP F unctional dep endency: 1 trigger, m - 1 dep endent BE s Other tree no des BE Basic element SBE Spare basic elemen t RBO X Repair b ox Case studies V OT V oting gates (syn thetic) DSP ARE Double-spare gates (syn thetic) R W C Railwa y cabinets [25, 46] HV C High voltage cab. (R W C subsys.) R C Rela y cab. (R W C subsys.) FTPP F ault tolerant parallel processor [21] HECS Hyp othetical example computer system [52] T able 1: Glossary of acronyms and abbreviations Gates. Non-lea ve no des are called interme diate events and are lab elled with gates , describing how com binations of lo wer failures propagate to upp er levels. Fig. 1 sho ws their syntax. Their meaning is as follows: the AND , OR , and VOT k gates fail if resp ectively all, one, or k of their m c hildren fail (with 1 6 k 6 m ). The latter is called the voting or k out of m gate. Note that VOT 1 is equiv alen t to an OR gate, and V OT m is equiv alent to an AND . The priority-and gate ( P AND ) is an AND gate that only fails if its c hildren fail from left to right (or simultane- ously). P AND s express failures that can only happ en in a particular order, e.g. a short circuit in a pump can only o ccur after a leakage. SP ARE gates ha ve one primary c hild and one or more sp ar e children: spares replace the primary when it fails. The FDEP gate has an input trigger and sev eral dep endent events : all de- p enden t even ts b ecome una v ailable when the trigger fails. FDEP s can mo del for instance net w orks elements that become unav ailable if their connecting bus fails. 5 Repair boxes. An RBO X determines which basic elemen t is repaired next ac- cording to a given policy . Thus all its inputs are BE s or SBE s. Repair even ts of basic elemen ts propagate along the tree analogously to fail even ts. Unlik e gates, an RBO X has no output since it do es not propagate failures. T op level ev ent. A full-system failure o ccurs if the top event (i.e. the ro ot no de) of the tree fails. HV cab P S Rcab Fig. 2: Tiny rft Example. The tree in Fig. 2 mo dels a railwa y-signal system, whic h fails if its high voltage and rela y cabinets fail [25, 46]. Th us, the top even t is an AND gate with children HVcab (a BE ) and Rcab . The latter is a SP ARE gate with primary P and spare S . All BE s are managed by one RBOX with repair priorit y HV cab > P > S . Notation. The no des of a tree 4 are giv en b y no des ( 4 ) = { 0 , 1 , . . . , n − 1 } . W e let v , w range ov er no des ( 4 ) . A function typ e 4 : no des ( 4 ) → { BE , SBE , AND , OR , V OT k , P AND , SP ARE , FDEP , RBOX } yields the type of each node in the tree. A function chil 4 : no des ( 4 ) → no des ( 4 ) ∗ returns the ordered list of c hildren of a no de. If clear from con text, we omit the superscript 4 from function names. Seman tics. The seman tics of static fault trees, i.e. trees that only feature the static gates AND , OR and V OT k , can b e giv en as a Boolean function. F or the gates P AND , SP ARE , FDEP the order of the failures matter, so a Bo olean func- tion do es not suﬃce. Therefore, the semantics for (repairable) dynamic fault trees is given in terms of stochastic transition mo dels, such as Mark ov automata, P etri nets, iosa , etc. F ollowing [38] we give semantics to rft as Input/Out- put Stochastic Automata ( iosa ), so that we can handle arbitrary probability distributions. Eac h state in the iosa represen ts a system conﬁguration, indicat- ing which components are operational and whic h hav e failed. T ransitions among states describ e ho w the conﬁguration changes when failures or repairs occur. More precisely , a state in the iosa is a tuple x = ( x 0 , . . . , x n − 1 ) ∈ S ⊆ N n , where S is the state sp ac e and x v denotes the state of no de v in 4 . The p ossible v alues for x v dep end on the type of v . The output z v ∈ { 0 , 1 } of node v indicates whether it is op erational ( z v =0 ) or failed ( z v =1 ) and is calculated as follo ws: – BE s (white circles in Fig. 1) hav e a binary state: x v = 0 if BE v is op erational and x v = 1 if it is failed. The output of a BE is its state: z v = x v . – SBE s (gra y circles in Fig. 1e) ha ve t wo additional states: x v = 2 , 3 if a dormant SBE v is resp. operational, failed. Here z v = x v mo d 2 . – AND s hav e a binary state. Since the AND gate v fails iﬀ all children fail: x v = min w ∈ chil ( v ) z w . An AND gate outputs its in ternal state: z v = x v . – OR gates are analogous to AND gates, but fail iﬀ any c hild fail, i.e. z v = x v = max w ∈ chil ( v ) z w for OR gate v . – V OT gates also ha ve a binary state: a VOT k gate fails iﬀ 1 6 k 6 m c hildren fail, th us z v = x v = 1 if k 6 P w ∈ chil ( v ) z w , and z v = x v = 0 otherwise. – P AND gates admit multiple states to represent the failure order of the c hil- dren. F or P AND v with t wo c hildren w e let x v equal: 0 if b oth children are op erational; 1 if the left c hild failed, but the righ t one has not; 2 if the righ t 6 c hild failed, but the left one has not; 3 if b oth c hildren hav e failed, the right one ﬁrst; 4 if b oth children ha ve failed, otherwise. The output of P AND gate v is z v = 1 if x v = 4 and z v = 0 otherwise. P AND gates with more children are handled b y exploiting P AND ( w 1 , w 2 , w 3 ) = P AND ( P AND ( w 1 , w 2 ) , w 3 ) . – SP ARE gate v leftmost input is its primary BE . All other (spare) inputs are SBE s. SBE s can b e shared among SP ARE gates. When the primary of v fails, it is replaced with an available SBE . An SBE is unav ailable if it is failed, or if it is replacing the primary BE of another SP ARE . The output of v is z v = 1 if its primary is failed and no spare is a v ailable. Else z v = 0 . – An FDEP gate has no output. All inputs are BE s and the leftmost is the trigger. W e consider non-destructive FDEP s [8]: if the trigger fails, the output of all other BE is set to 1 , without aﬀecting the internal state. Since this can b e mo delled by a suitable com bination of OR gates [38], we omit the details. F or example, the rft from Fig. 2 starts with all operational elemen ts, so the initial state is x 0 = (0 , 0 , 2 , 0 , 0) . If then P fails, x P and z P are set to 1 (failed) and S b ecomes x S = 0 (activ e and op erational spare), so the state c hanges to x 1 = (0 , 1 , 0 , 0 , 0) . The traces of the iosa are giv en b y x 0 x 1 · · · x n ∈ S ∗ , where a c hange from x j to x j +1 corresp onds to transitions triggered in the iosa . Nondeterminism. Dynamic fault trees may exhibit nondeterministic b ehaviour as a consequence of undersp eciﬁed failure behaviour [17, 33]. This can happ en e.g. when t wo SP ARE s hav e a single shared SBE : if all elemen ts are failed, and the SBE is repaired ﬁrst, the failure b ehaviour depends on which SP ARE gets the SBE . Monte Carlo sim ulation, ho wev er, requires fully sto c hastic mo dels and cannot cop e with nondeterminism. T o ov ercome this problem we deploy the the- ory from [19, 38]. If a fault tree adheres to some mild syn tactic conditions, then its iosa semantics is we akly deterministic , meaning that all resolutions of the nondeterministic choices lead to the same probabilit y v alue. In particular, we require that (1) eac h BE is connected to at most one SP ARE gate, and (2) BE s and SBE s connected to SP ARE s are not connected to FDEP s. In addition to this, some semantic decisions ha ve b een ﬁxed, e.g. the semantics of P AND is fully sp eciﬁed, and p olicies should b e pro vided for RBOX and spare assignmen ts. Dep endabilit y metrics. An imp ortan t use of fault trees is to compute relev ant dep endabilit y metrics. Let { X t } t > 0 b e the stochastic process induced by 4 [15], and let X t,v b e the random v ariable that represents the (distribution of the) state of the top ev ent of 4 at time t . W e fo cus on tw o p opular metrics: • system r eliability : is the con tinuit y of correct service, i.e. the probability of observing no top even t failure b efore some mission time T > 0 , viz. REL T = Pr ob  ∀ t ∈ [0 ,T ] . X t,v = 0  ; • system availability : the prop ortion of time that the system remains opera- tional in the long-run, viz. A V A = lim t →∞ Pr ob ( X t,v = 0) . System unr eliability and unavailability are the reverse of these metrics. That is: UNREL T = 1 − REL T and UNA V A = 1 − A V A . 7 3 Sto c hastic sim ulation for F ault T rees Input-Output Stochastic A utomata (IOSA). iosa [19, 18] are an exten- sion of GSMP [49] amenable to comp ositional mo delling. An iosa is a state- transition system where the residence time in a state is gov erned by a pdf . iosa s feature tw o ingredients that are crucial for our analysis: (1) residence times can b e gov erned b y arbitrary probability distributions describ ed b y real- v alued clo c ks, and (2) discrete transitions are labelled b y actions, and allo w automata to comm unicate with each other. T o record the passage of time and control the o ccurrence of ev ents, iosa use real-v alued v ariables called clo cks . Clo cks are set to a p ositive random v alue according to the (state-dep endent) asso ciated pdf . As time evolv es, all clo cks coun t down from their resp ective v alues at the same rate. When the v alue of a clo c k reac hes zero it ma y trigger some action . Thus, to mo del BE e in a fault tree, w e asso ciate a clo ck to F e and another to R e . As a matter of fact, each node in an ft is mo delled as an iosa automaton. The propagation of fail/repair even ts in the tree is done by (discrete, instantaneous) action sync hronisation among automata. F ormally: Deﬁnition 1 ( iosa [19]). A n Input/Output Sto chastic A utomaton with Ur- gency is a tuple ( S , A , C , − → , s 0 , C 0 ) wher e: (i) S is a denumer able set of states ; (ii) A is a denumer able set of lab els p artitione d into input labels A i and output lab els A o , wher e A u ⊆ A ar e urgent lab els ; (iii) C is a ﬁnite set of clo cks s.t. e ach x ∈ C has an asso ciate d c ontinuous pr ob ability me asur e µ x with supp ort on R > 0 ; (iv) − → ⊆ S × C × A × C × S is a transition function ; (v) s 0 ∈ S is the initial state ; and (vi) C 0 ⊆ C ar e clo cks initialize d in s 0 . Six constraints on − → ensure that, in closed iosa obtained from the paral- lel comp osition of all automata, nondeterminism is restricted to urgen t actions. These semantic constraints are translated into the syntactic conditions previ- ously men tioned for ft s. F or insigh ts see [19]; details on ho w to represent gates and basic elemen ts with iosa automata are in [38]. Mo delling fault trees as iosa allows us to p erform Mon te Carlo sim ulation: w e generate tr ac es , i.e. sequences of states x 0 , x 1 , . . . , x m where eac h x j , x j +1 is the pro jection on S 2 of an elemen t of − → from Def. 1. Then, we estimate dep endabilit y metrics via statistical analyses on a set of sampled traces. Standard Mon te Carlo simulation (SMC). Monte Carlo simulation takes random samples from stochastic models to estimate a (dependability) metric of in terest. F or instance, to estimate the unreliabilit y of a tree 4 we sample N indep enden t traces from its iosa semantics. An un biased statistical estimator for p = UNREL T is the prop ortion of traces observing a top level ev ent, that is, ˆ p N = 1 N P N j =1 X j where X j = 1 if the j -th trace exhibits a top level failure b efore time T and X j = 0 otherwise. The statistical error of ˆ p is typically quan tiﬁed with t wo n umbers δ and ε s.t. ˆ p ∈ [ p − ε, p + ε ] with probability δ . The interv al ˆ p ± ε is called a c onﬁdenc e interval ( ci ) with co eﬃcien t δ and precision 2 ε . Suc h procedures scale linearly with the num ber of tree no des and cater for a wide range of pdf s, even non-Marko vian distributions. How ever, they encounter 8 a b ottlenec k to estimate r ar e events : if p ≈ 0 , very few traces observ e X j = 1 . Therefore, the v ariance of estimators lik e ˆ p becomes h uge, and ci s b ecome v ery broad, easily degenerating to the trivial in terv al [0 , 1] . Increasing the num b er of traces alleviates this problem, but ev en standard ci settings—where ε is relativ e to p —require sampling an unacceptable n umber of traces [42]. F or instance, c ho osing δ = 0 . 95 and ε = p 10 (“95% conﬁdence and 10% relative error”) requires N > 384 / p samples. Thus if UNREL T ≈ 10 − 8 , one needs N > 38400000000 traces, making the sim ulation times unacceptably long. Rare even t sim ulation tec hniques solve this speciﬁc problem. Rare Even t Simulation (RES). res techniques [42] increase the amount of traces that observe the rare even t, e.g. a top level even t in an rft . T wo prominen t classes of res techniques are imp ortanc e sampling , whic h adjusts the pdf of failures and repairs, and imp ortanc e splitting ( isplit [36]), whic h samples more (partial) traces from states that are closer to the rare even t. W e fo cus on isplit due to its ﬂexibilit y with resp ect to the probability distributions. isplit can b e eﬃciently deploy ed as long as the rare even t γ can b e de- scrib ed as a nested sequence of less-rare even ts γ = γ M ( γ M − 1 ( · · · ( γ 0 . This decomp osition allows isplit to study the conditional probabilities p k = Pr ob ( γ k +1 | γ k ) separately , to then compute p = Pr ob ( γ ) = Q M - 1 k =0 Pr ob ( γ k +1 | γ k ) . Moreo ver, isplit requires all conditional probabilities p k to b e m uch greater than p , so that estimating eac h p k can b e done eﬃcien tly with smc . The k ey idea b ehind isplit is to deﬁne the ev ents γ k via a so called imp or- tanc e function I : S → N that assigns an imp ortanc e to each state s ∈ S . The higher the importance of a state, the closer it is to the rare even t γ M . Ev ent γ k collects all states with imp ortance at least ` k , for certain sequence of thr eshold levels 0 = ` 0 < ` 1 < · · · < ` M . F ormally: γ k = { s ∈ S | I ( s ) > ` k } . In other w ords, a higher imp ortance is assigned to states from which it is more likely to observe the rare even t. That is, for s, s 0 ∈ S s.t. s ∈ γ k and s 0 ∈ γ k 0 , one w ants I ( s ) < I ( s 0 ) iﬀ k < k 0 . Because then, in the nested sequence of ev ents γ 0 ) · · · ) γ M , each step γ k − 1 → γ k mak es it more lik ely to observ e the rare ev ent. Therefore, choosing man y thresholds ( M  0 ) all v ery close to each other, one ensures that Pr ob ( γ k | γ k − 1 )  0 for all 0 < k 6 M . Simply put, one makes a lot of bab y steps in the right direction. T o exploit the imp ortance function I in the simulation pro cedure, isplit samples more (partial) traces from states with higher importance. T wo well- kno wn metho ds are deplo y ed and compared in this paper: Fixed Eﬀort and rest ar t . Fixe d Eﬀort ( fe [23]) samples a predeﬁned amoun t of traces in eac h region S k = γ k \ γ k +1 = { s ∈ S | ` k +1 > I ( s ) > ` k } . Thus, starting at γ 0 it ﬁrst estimates the proportion of traces that reac h γ 1 , i.e. p 0 = Pr ob ( γ 1 | γ 0 ) = Pr ob ( S 0 ) . Next, from the states that reached γ 1 new traces are generated to estimate p 1 = Pr ob ( S 1 ) , and so on until p M . Fixed Eﬀort thus requires that ( i ) eac h trace has a clearly deﬁned “end,” so that estimations of eac h p k ﬁnish with probability 1, and ( ii ) all rare even ts reside in the upp ermost region. In particular, using fe for steady-state analysis (e.g. to estimate UNA V A ) requires regeneration theory [23], whic h is hard to apply to non-Marko vian mo dels. 9 ✘ ✔ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✔ (a) fe -5 for Pr ob ( ¬ 8 U 4 ) ✔ ✗ ✘ ✗ ✘ ✗ ✘ ✗ (b) rst-es for UNREL T Fig. 3: Imp ortance Splitting algorithms Fixed Eﬀort & rest ar t Example. Fig. 3a sho ws Fixed Eﬀort estimating the probabilit y to visit states lab elled 4 before others lab elled 8 . States 4 ha v e importance >13, and thresh- olds ` 1 , ` 2 = 4 , 10 partition the state space in regions { S i } 2 i =0 s.t. all 4 ∈ S 2 . The eﬀort is 5 simulations per region, for all regions: we call this algorithm fe -5. In region S 0 , 2 sim ulations made it from the initial state to threshold ` 1 , i.e. they reac hed some state with importance 4 before visiting a state 8 . In S 1 , starting from these tw o states, 3 sim ulations reached ` 2 . Finally , 2 out of 5 simulations visited states 4 in S 2 . Th us, the estimated rare even t probabilit y of this run of fe 5 is ˆ p = Q 2 i =1 ˆ p i = 2 5 3 5 2 5 = 9 . 6 × 10 − 2 . REST AR T ( rst - [57, 56]) is another res algorithm, whic h starts one trace in γ 0 and monitors the imp ortance of the states visited. If the trace up-crosses threshold ` 1 , the ﬁrst state visited in S 1 is sa ved and the trace is cloned, aka split —see Fig. 3b. This mechanism rewards traces that get closer to the rare ev ent. Each clone then ev olves indep endently , and if one up-crosses threshold ` 2 the splitting mechanism is rep eated. Instead, if a state with importance below ` 1 is visited, the trace is trunc ate d ( 7 in Fig. 3b). In general, each clone is truncated as so on as it visits a state with imp ortance low er than its level of creation. This penalises traces that mov e aw ay from the rare even t. T o a void truncating all traces, the one that spa wned the clones in region S k can go below imp ortance ` k . T o deploy an un biased estimator for p , rest ar t measures ho w m uch split w as required to visit a rare state [56]. In particular, rest ar t do es not need the rare ev ent to b e deﬁned as γ M [53], and it w as devised for steady- state analysis [57] (e.g. to estimate UNA V A ) although it can also b een used for transien t studies as depicted in Fig. 3b [54]. 4 Imp ortance Splitting for FT A The eﬀectiv eness of isplit crucially relies on the c hoice of the imp ortance function I as well as the threshold levels ` k [36]. T raditionally , these are given b y domain and/or res exp erts, requiring a lot of domain knowledge. This section presen ts a technique to obtain I and the ` k automatically for an rft . 10 4.1 Comp ositional imp ortance functions for F ault T rees By the core idea b ehind importance splitting, states that are more likely to lead to the rare even t should hav e a higher imp ortance. T o achiev e this, the key lies in deﬁning an importance function I and thresholds ` k that are sensitive to both the state space S and the transition probabilities of the system. F or us, S ⊆ N n are all p ossible states of a repairable fault tree ( rft ). Its top even t fails when certain nodes fail in certain order, and remain failed b efore certain repairs o ccur. T o exploit this for isplit , the structure of the tree must be embedded into I . The strong dep endence of the imp ortance function I on the structure of the tree is easy to see in the following example. T ake the rft 4 from Fig. 2 and let its current state x b e s.t. P is failed and HVcab and S are op erational. If the next even t is a repair of P , then the new state x 0 (where all basic elemen ts are op erational) is farther from a failure of the top even t. Hence, a goo d imp ortance function should satisfy I ( x ) > I ( x 0 ) . Oppositely , if the next even t had been a failure of S leading to state x 00 , then one would w ant that I ( x ) < I ( x 00 ) . The k ey observ ation is that these inequalities dep end on the structure of 4 as w ell as on the failures/repairs of basic elements. Because if instead of an AND , the top even t were a P AND gate (call this tree 4 ∗ ), the imp ortance function should b ehav e in the exact opp osite w ay . That is, in tree 4 ∗ one wan ts that I ( x ) > I ( x 00 ) , since in x 00 the righ t child of the top P AND has failed b efore the left child. When this happens, P AND gates go into an out-of-order internal state, and cannot output a failure. So the same step from x to x 00 has completely diﬀeren t meanings for 4 and for 4 ∗ , as a result of their structure b eing diﬀeren t. In view of the ab o ve, any attempt to deﬁne an imp ortance function for an arbitrary fault tree 4 must put its gate structure in the forefront. In T able 2 w e in tro duce a comp ositional heuristic for this, whic h deﬁnes lo c al imp ortanc e functions distinguished p er no de type. The imp ortance function asso ciated to no de v is I v : N n → N . W e deﬁne the glob al imp ortanc e function of the tree ( I 4 or simply I ) as the lo cal imp ortance function of the top even t no de of 4 . Th us, I v is deﬁned in T able 2 via structural induction in the fault tree. It is deﬁned so that it assigns to a faile d no de v its highest imp ortanc e value . F unctions with this prop erty deplo y the most eﬃcien t isplit implementations [36], and some res algorithms (e.g. Fixed Eﬀort) require this prop ert y [23]. In the following we explain our deﬁnition of I v . If v is a failed BE or SBE , then its imp ortance is 1 ; else it is 0 . This matches the output of the no de, th us I v ( x ) = z v . Intuitiv ely , this reﬂects how failures of basic elemen ts are positively correlated to top even t failures. The imp ortance of AND , OR , and VOT k gates dep ends exclusively on their input. The imp ortance of an AND is the sum of the imp ortance of their c hildren scaled by a normalisation factor. This reﬂects that AND gates fail when all their children fail, and each failure of a c hild brings an AND closer to its o wn failure, hence increasing its importance. Instead, since OR gates fail as soon as a single c hild fails, their imp ortance is the maximum imp ortance among its c hildren. The imp ortance of a VOT k gate is the sum of the k (out of m ) c hildren with highest imp ortance v alue. 11 T able 2: Comp ositional imp ortance function for rft s. typ e ( v ) I v ( x ) BE , SBE z v AND lcm v · P w ∈ chil ( v ) I w ( x ) max I w OR lcm v · max w ∈ chil ( v ) n I w ( x ) max I w o V OT k lcm v · max W ⊆ chil ( v ) , | W | = k n P w ∈ W I w ( x ) max I w o SP ARE lcm v · max  P w ∈ chil ( v ) I w ( x ) max I w , z v · m  P AND lcm v · max  I l ( x ) max I l + or d I r ( x ) max I r , z v · 2  where ord = 1 if x v ∈ { 1 , 4 } and or d = − 1 otherwise with max I v = max x ∈ S I v ( x ) and lcm v = lcm  max I w   w ∈ chil ( v )  Omiting normalisation may yield an undesirable imp ortance function. T o understand wh y , suppose a binary AND gate v with c hildren l and r , and deﬁne I naive v ( x ) = I l ( x ) + I r ( x ) . Suppose that I l tak es it highest v alue in max I l = 2 while I r in max I r = 6 and assume that states x and x 0 are s.t. I l ( x ) = 1 , I r ( x ) = 0 , I l ( x 0 ) = 0 , I r ( x 0 ) = 3 . This means that in b oth states one child of v is “go o d-as-new” and the other is “half-failed” and hence the system is equally close to fail in b oth cases. Hence we expect I naive v ( x ) = I naive v ( x 0 ) when actually I naive v ( x ) = 1 6 = 3 = I naive v ( x 0 ) . Instead, I v op erates with I l ( x ) max I l and I r ( x ) max I r , which can b e interpreted as the “p ercentage of failure” of the c hildren of v . T o mak e these num bers in tegers we scale them by lcm v , the le ast c ommon multiple of their max importance v alues. In our case lcm v = 6 and hence I v ( x ) = I v ( x 0 ) = 3 . Similar problems arise whit all gates, hence normalization is applied in general. SP ARE gates with m c hildren (including its primary) b ehav e similarly to AND gates: ev ery failed child brings the gate closer to failure, as reﬂected in the left op erand of the max in T able 2. How ev er, SP ARE s fail when their primaries fail and no SBE s are available , e.g. p ossibly being used b y another SP ARE . This means that the gate could fail in spite of some children being op erational. T o accoun t for this we exploit the gate output: multiplying z v b y m we giv e the gate its maximum v alue when it fails, even when this happens due to unav ailable but operational SBE s. F or a P AND gate v w e hav e to carefully lo ok at the states. If the left child l has failed, then the righ t child r con tributes p ositively to the failure of the P AND and hence the imp ortance function of the no de v . If instead the right child has failed ﬁrst, then the P AND gate will not fail and hence w e let it con tribute negativ ely to the imp ortance function of v . Thus, we multiply I r ( x ) max I r (the normalized imp ortance function of the right c hild) b y − 1 in the later case (i.e. when state x v / ∈ { 1 , 4 } ). Instead, the left child alwa ys con tribute positively . 12 Finally , the max op eration is tw o-fold: on the one hand, z v · 2 ensures that the imp ortance v alue remains at its maximun while failing ( P AND s remain failed ev en after the left c hild is repaired); on the other, it ensures that the smallest v alue p osible is 0 while op erational (since imp ortance v alues can not b e negative.) 4.2 A utomatic importance splitting for FT A Our comp ositional imp ortance function is based on the distribution of op era- tional/failed basic elemen ts in the fault tree, and their failure order. This follows the core idea of imp ortance splitting: the more failed BE s/ SBE s (in the right order), the closer a tree is to its top ev ent failure. Ho wev er, isplit is ab out running more simulations from state with higher pr ob ability to lead to rare states. This is only partially reﬂected by whether basic elemen t b is failed. Probabilities lie also in the distributions F b , R b , D b . These distributions gov ern the transitions among states x ∈ S , and can b e exploited for imp ortance splitting. W e do so using the tw o-phased approac h of [12, 13], whic h in a ﬁrst (static) phase computes an imp ortance function, and in a second (dynamic) phase selects the thresholds from the resulting imp ortance v alues. In our current work, the ﬁrst phase runs breadth-ﬁrst search in the iosa mo dule of eac h tree node. This computes node-lo cal imp ortance functions, that are aggregated into a tree-global I using our comp ositional function in T able 2. The second phase inv olves running “pilot simulations” on the imp ortance- lab elled states of the tree. Running simulations exercises the fail/repair distri- butions of BE s/ SBE s, imprin ting this information in the thresholds ` k . Several algorithms can do such sele ction of thr esholds . They operate sequentially , start- ing from the initial state—a fully op erational tree—which has importance i 0 = 0 . F or instance, Expected Success [11] runs N ﬁnite-life sim ulations. If K < N 2 sim- ulations reach the next smallest imp ortance i 1 > i 0 , then the ﬁrst threshold will b e ` 1 = i 1 . Next, N simulations start from states with imp ortance i 1 , to deter- mine whether the next imp ortance i 2 should b e c hosen as threshold ` 2 , and so on. Exp ected Success also computes the eﬀort p er splitting region S k = { x ∈ S | ` k +1 > I ( x ) > ` k } . F or Fixed Eﬀort, “eﬀort” is the base n umber of simulations to run in region S k . F or rest ar t , it is the num ber of clones spa wned when threshold ` k +1 is up-crossed. In general, if K out of N pilot sim ulations mak e it from ` k − 1 to ` k , then the k -th eﬀort is  N K  . This is c hosen so that, during res estimations, one sim ulation makes it from threshold ` k − 1 to ` k on a verage. Th us, using the metho d from [12, 13] based on our imp ortance function I 4 , w e compute (automatically) the thresholds and their eﬀort for tree 4 . This is all the meta-information required to apply imp ortance splitting res [23, 22, 12]. 5 T o ol chain implementation W e implemen ted the theory introduced in Sec. 4 in a full-stack to ol c hain. Its input are plain text ﬁles in the Galileo textual format [51, 16, 50]: a widespread syn tax to describe fault trees [9, 20, 37]. Galileo w as not designed for repairs, 13 and has limited support for non-Marko vian distributions: we thus extend it to ﬁt our needs. Fig. 4 sho ws the to ol chain: a con verter parses the rft deﬁned in extended Galileo; it generates an iosa model, prop erty queries, and comp o- sitional imp ortance function (using T able 2); from this input, the FIG to ol can implemen t res (imp ortance splitting) algorithms, and use them to estimate system unreliabilit y and unav ailability . Importance function Metrics  Property query (metric) IOSA semantic model RFT model (extended Galileo) RFT ⇾ IOSA converter FIG Fig. 4: T o ol chain 5.1 Extensions to Galileo Standard Galileo supp orts exp onential, log-normal, and W eibull pdf s. W e use the keyw ord EXT _ failPDF to deﬁne arbitrary failure distributions. In Code 1, the SP ARE gate ( Gate2 ) has its primary ( BE _ C ) and one spare ( BE _ D ), whose resp. fail pdf s are Ra yleigh ( σ = 0 . 06 ) and exponential ( λ = 0 . 0011 ). W e also allo w the dormancy pdf of an SBE to b e independent of its fail pdf . F or this we add the EXT _ dormPDF keyw ord to deﬁne an arbitrary dormancy pdf . Thus we deﬁne the dormancy of BE _ D as an Erlang( k = 3 , λ = 9) in Co de 1. In the current implementation, a new time to failure is sampled (from the corresponding pdf ) as the SBE is activ ated when the primary BE fails. This is a simpliﬁcation since we work with p otentially non-Mark ovian distributions; more realistic implemen tations are prop osed as future work in Sec. 8. t o p l e v e l " G a t e 2 " ; " G a t e 2 " w s p " B E _ C " " B E _ D " ; " B E _ C " E X T _ f a i l P D F = r a y l e i g h ( 6 . 0 E - 2 ) ; " B E _ D " l a m b d a = 1 . 1 1 E - 3 E X T _ d o r m P D F = e r l a n g ( 3 , 9 ) ; Co de 1: SBE s and arbitrary pdf s in extended Galileo Finally , we also extend Galileo with the keyw ords repairbox _ priority and EXT _ repairPDF . These resp ectiv ely deﬁne arbitrary repair p olicies for the RBOX elemen ts, and the repair pdf s of BE s and SBE s. All BE s in Co de 2 are repairable, with repair time uniformly distributed on the real interv als [8 , 24] and [8 , 12] . The last line of the co de deﬁnes the RBO X of the system, which handles one repair at a time. Its repair p olicy determines which BE to choose when more than one is failed at the same time. F or instance, if all BE s fail and the RBO X “ﬁnishes repairing” BE _ G , it will next repair BE _ E (b efore BE _ F ). 14 t o p l e v e l " G a t e 3 " ; " G a t e 3 " a n d " B E _ E " " B E _ F " " B E _ G " ; " B E _ E " l a m b d a = 6 . 0 E - 5 E X T _ r e p a i r P D F = u n i f o r m ( 8 , 2 4 ) ; " B E _ F " l a m b d a = 7 . 0 E - 5 E X T _ r e p a i r P D F = u n i f o r m ( 8 , 2 4 ) ; " B E _ G " l a m b d a = 6 . 0 E - 5 E X T _ r e p a i r P D F = u n i f o r m ( 8 , 1 2 ) ; " R B 1 " r e p a i r b o x _ p r i o r i t y " B E _ E " " B E _ F " " B E _ G " ; Co de 2: Repairs in extended Galileo 5.2 Con verter: RFT → IOSA W e also implemented the iosa seman tics for rft [38]. F or this w e dev elop ed a Ja v a textual con verter whose input is an rft deﬁned in extended Galileo. The con verter outputs the iosa semantics of the tree, and the comp osition function for the corresp onding lo cal imp ortance functions using T able 2. Algorithm 1 Con version from rft to iosa 1: pro cedure RFTtoIOSA ( in , out [3]) 2: rft ← p arseRFT ( in ) 3: conver tDynamicGa tes ( rft ) 4: cif ← templa teImpFun ( rft ) 5: out [0] ← conver tTree ( rft ) 6: out [0] ← conver tRBOX ( rft , out [0]) 7: out [1] ← genera teProper ties ( out [0]) 8: out [2] ← genera teImpFun ( cif , out [0]) 9: end pro cedure The conv ersion procedure is show as Algorithm 1. The ast parsed from the rft is ﬁrst pro cessed to conv ert the dynamic P AND and FDEP gates as de- scrib ed in Sec. 4.1. F rom the resulting tree we implement (a template of) the imp ortance function follo wing T able 2. Next, the iosa for eac h tree no de is computed following [38]. Once all automata names were thus deﬁned, the RBO X is built and added to the (now ﬁnal) iosa seman tics. Also the property queries (system unreliabilit y and unav ailability) are then deﬁned. Finally , the imp or- tance function template is ﬁlled with the iosa automata names, and returned as a complete imp ortance function. Regarding the queries, unreliability and unav ailability are enco ded as v arian ts of pctl [27] and csl [2] that FIG can tak e as input. F or instance, sa y w e w ant to estimate the unreliabilit y of the rft deﬁned in Co de 2 at T = 15 . 5 . If the con verter deﬁned the v ariable count , internal to the iosa module corresponding to the Gate3 AND , then genera teProper ties() pro duces a pctl query as in Co de 3. p r o p e r t i e s 15 P ( U < = 1 5 . 5 G a t e 3 . c o u n t = = 3 ) e n d p r o p e r t i e s Co de 3: Unreliabilit y prop erty query for Code 2 5.3 FIG: RES to estimate rare dep endabilit y metrics The FIG to ol w as devised to study temp oral logic queries of iosa models [10], describ ed either in their native syntax or in the jani mo del exchange format [14]. Using res em b edded in statistical mo del chec king, FIG computes (arbitrary) ci s that estimate the degree to whic h a mo del complies to a property speciﬁcation. FIG was designed for automatic res , implemen ting the algorithms from [12, 13] to derive an imp ortance function from the system mo del. It can select an aggregation operator, to comp ose the local functions computed for the mo dules of the system. This, ho w ever, dep ends on the prop erty query , and does not lead to high quality imp ortance functions for ft a , where the structure is in the tree and not in the query—see [10] and our discussion in Sec. 4.1. FIG can also be input a composition function, to aggregate the local im- p ortance functions of the system mo dules. This is the feature used b y our rft → iosa conv erter, as detailed in Sec. 5.2 Thus, from the iosa mo del and the imp ortance function pro duced by our conv erter, FIG performs res to compute ci s around the dep endabilit y metrics queried. 6 Exp erimen tal ev aluation 6.1 General setup Using our to ol chain, we hav e veriﬁed the eﬃciency of the theory introduced in Sec. 4. W e experimented on 26 repairable non-Marko vian dft s using diﬀer- en t sim ulation algorithms: 1. Standard Mon te Carlo ( smc ); 2. rest ar t with thresholds selected via the Sequen tial Monte Carlo algorithm [10, 11] for diﬀer- en t splitting v alues ( rst - n for n = 2 , 3 , 5 , 8 , 11 ); 3. rest ar t with thresholds selected via the Exp ected Success algorithm [11] ( rst-es ); and 4. Fixed Eﬀort [23, 11] for diﬀerent n umber of runs p erformed in eac h importance region ( fe - n for n = 8 , 12 , 16 , 24 , 32 ). res algorithms were implemented using the imp or- tance function deﬁned in T able 2, by following the theory from [12, 13, 11] to c ho ose thresholds and splitting v alues automatically . W e ran our experiments in tw o types of no des of a SLURM cluster running 64-bit Lin ux (Ubun tu, k ernel 3.13.0-168): kor envliet no des ha ve CPUs In tel ® Xeon ® E5-2630 v3 @ 2.40 GHz, and 64 GB of DDR4 @ 1600 MHz RAM memory; c aserta has CPUs Intel ® Xeon ® E7-8890 v4 @ 2.20 GHz, and 2 TB of DDR4 @ 1866 MHz RAM memory . 16 6.2 Division of experimental instances W e exp erimented on sev en case studies. These were originally Mark ovian and without repairs [21, 52, 25, 46]. T o turn them into non-Marko vian rft s w e added RBO X elements and modiﬁed its fail and repair pdf s as detailed in Sec. 6.3. Moreo ver, to delineate the performance b o ost of our theory in the analysis of rare dep endability metrics, we tested each case study in increasingly resilient conﬁgurations. F or this, w e parameterised them: a higher v alue of the param- eter in a case study implies a more resilient system, i.e. smaller una v ailability or unreliabilit y v alues. The v alues of the parameters are given in T able 3 and describ ed b elo w. Figs. 8 and 9 show that, the rarer the metric, the more eﬃcien t our res implemen tation b ecomes w.r.t. smc . T able 3: General ov erview of exp erimental setting . Metric Case study Diﬃcult y Sim ulation algorithms P . Est. TO. UN A V AILABILITY VOT 2 8 . 47 × 10 - 4 5 m smc rst - { 2 , 3 , 5 , 8 , 11 } 3 1 . 94 × 10 - 5 30 m 4 4 . 70 × 10 - 7 3 h HECS 1 6 . 26 × 10 - 3 5 s smc rst-es rst - { 2 , 5 , 8 , 11 } 2 6 . 11 × 10 - 5 20 s 3 1 . 56 × 10 - 6 2 m 4 1 . 16 × 10 - 7 10 m 5 2 . 02 × 10 - 8 1 h RC 3 3 . 73 × 10 - 5 30 s smc rst-es rst - { 2 , 5 , 8 , 11 } 4 3 . 39 × 10 - 6 5 m 5 5 . 07 × 10 - 7 30 m 6 1 . 02 × 10 - 7 2 h R WC 1 4 . 88 × 10 - 4 30 s smc rst - { 2 , 5 , 8 , 11 } 2 3 . 15 × 10 - 5 5 m 3 3 . 03 × 10 - 6 30 m 4 4 . 55 × 10 - 7 2 h UN RELIABILITY DSP ARE 3 7 . 03 × 10 - 4 5 m smc rst - { 2 , 3 , 5 , 8 , 11 } fe - { 8 , 16 , 32 } 4 6 . 08 × 10 - 5 30 m 5 7 . 31 × 10 - 6 3 h HECS 2 1 . 98 × 10 - 3 20 s smc rst - { 2 , 5 , 8 , 11 } fe - { 8 , 16 , 32 } 3 3 . 60 × 10 - 5 5 m 4 2 . 35 × 10 - 6 30 m 5 2 . 61 × 10 - 7 3 h FTPP (triad) 4 1 . 20 × 10 - 2 30 s smc rst - { 2 , 5 } fe - { 8 , 12 , 16 , 24 } 5 2 . 49 × 10 - 4 4 m 6 6 . 34 × 10 - 7 40 m HVC 4 1 . 11 × 10 - 2 90 s smc rst - { 2 , 5 , 8 , 11 } fe - 8 5 4 . 61 × 10 - 4 5 m 6 3 . 44 × 10 - 5 30 m 7 4 . 17 × 10 - 6 2 h R WC 2 7 . 03 × 10 - 4 5 m smc rst - { 2 , 5 , 8 , 11 } 3 6 . 08 × 10 - 5 30 m 4 7 . 31 × 10 - 6 2 h 17 F or each parametric case study we compare the simulation algorithms men- tioned ab ov e. W e estimate system unav ailability and unreliabilit y at time T = 10 3 . F or each combination of metric, fault tree, and algorithm—an instanc e —we computed ci s of 95% conﬁdence lev el around the p oint estimate for the metric. T o do so, we ran simulations with FIG for predeﬁned wall-clock runtimes (that dep end on the case and parameter as detailed in T able 3), and built 10 ci s for eac h instance. W e then compared the av erage width of the ci s p er instance. The algorithm with the most precise (narro west) in terv als w as the most eﬃcient to compute that metric on that tree. In Sec. 6.4 w e show that for the most re- silien t conﬁgurations of all case studies, res algorithms implemented from our imp ortance functions are more eﬃcien t (and as automatic) than smc . An ov erview of the full exp erimental setting is giv en in T able 3. The pa- rameterised conﬁgurations of all case studies are detailed in Diﬃcult y . Its sub- columns are: [ P . ] that gives the parameter v alue of each case study—see Sec. 6.3; [ TO. ] for “Time-Out,” i.e. the sim ulation run time, higher for the more resilient conﬁgurations of a case study to let the algorithms sample some rare even t; and [ Est. ] that gives the p oin t estimate a v eraged o ver all v alues † , ranging ov er all simulation algorithms and the 10 (repeated and indep endent) computations p erformed for eac h tree and metric. Note that the Time-Out chosen for a (parameterised) case study ma y b e insuﬃcien t for certain algorithms to observe any rare even t, e.g. for smc . If that happ ens, the sto chastic mo del c heck er FIG rep orts a “null estimate” [0 , 0] . Moreo ver, the sim ulation of random even ts depend on the rng —and the seed— used b y FIG , so diﬀerent runs may yield diﬀerent ci s. T o account for these factors when assessing the outcome of each instance, w e computed eac h ci 10 times. This gives us three dimensions to assess the p erformance of an algorithm in an instance: ( i ) how many times did it yield a not-null estimate, ( ii ) what w as the a verage width of the resulting ci s for that case study and parameter (considering not-n ull estimates only), and ( iii ) what was the v ariance of those widths. F or example, running simulations for 2 min utes, we estimated the una v ail- abilit y of the parameterised case study “HECS-3. ” Using smc we computed 10 indep enden t ci s. The same w as done for eac h of rst -2,5,8,11, es . Results are sho wn as whisk er-bar plots in Sec. 6.4. Each bar corresp onds to the ci s com- puted for an instance, i.e. a sp eciﬁc algorithm on one case study with certain parameter. The height of the bar is the mean ci width for the 10 iterations of the algorithm (discarding null estimates). The whisk ers on top of it are the standard deviation of these widths, and a b old num b er at its base (e.g. 3 10 ) indicates ho w many iterations of algo yielded not-null estimates. The ci s themselv es—whose width w e compare in Sec. 6.4—for unav ailability of HECS-2, HECS-3, and HECS-4, are shown in Fig. 5. The three horizontal lines are their corresp. una v ailability v alues: 6 . 11 × 10 - 5 , 1 . 56 × 10 - 6 , and 1 . 16 × 10 - 7 . Here w e tested algorithms smc , rst-es , and rst - { 2 , 5 , 8 , 11 } . T o explore res div ersity , yet keep the amoun t of experimentation manageable, in other cases we tested diﬀeren t algorithms—see T able 3. † W e remov ed outliers using a mo diﬁed Z-score with m = 2 [29]. 18 1e-08 1e-07 1e-06 1e-05 1e-04 SMC RST-2 RST-5 RST-8 RST-11 RST-ES Fig. 5: ci s for una v ailabilit y of HECS 1e-06 1e-05 1e-04 1e-03 1e-02 SMC RST-2 RST-5 RST-8 RST-11 FE-8 Fig. 6: ci s for unreliability of HV C The 10 ci s computed p er instance are separated in Fig. 5 by vertical gra y dashed lines. The ci s are the coloured vertical error-bars. Some of them are the trivial real interv al [0 , 1] and app ear as vertical coloured lines; e.g. all iterations of smc for HECS-4 except for the 4 th , 8 th , and 10 th . Sometimes only one extreme of the interv al is not trivial, e.g. the 4 th and 10 th iterations of smc for HECS- 4 whic h respectively yielded [0 , 6 . 31 × 10 - 7 ] and [0 , 5 . 07 × 10 - 7 ] . When not ev en a p oin t estimate w as computed for an iteration, the ci is missing completely from the plot, e.g. the 8 th iteration of smc for HECS-4. Fig. 6 shows the ci s computed in the same w a y for unreliability of the HV C case study . In this case we exp erimented with the algorithms smc , fe -8, and rst - { 2 , 5 , 8 , 11 } . It can b e seen that for HVC-6 (the downmost horizontal line at 3 . 44 × 10 - 5 ) only the third iteration of smc yielded a complete ci , and it is very wide. In contrast, algorithms like rst -2 and rst -8 alw ays con verged to reasonable ci estimates in 30 m of simulation runtime for this system con- ﬁguration. This is the trend with all exp eriments: as exp ected, the more rare the metric, the wider the ci s computed in the time limit via smc , and at some point it becomes infeasible to conv erge to non-trivial ci s. In contrast, res algorithms—implemen ted from our imp ortance function—can still compute ci s in the most extreme situations exp erimen ted with our case studies. This is con- v eyed in Sec. 6.4 via whisk er-bar plots, that show the av erage width of ci s ac hieved per instance. 19 6.3 The case studies W e brieﬂy describ e the sev en parametric case studies: V OT and DSP ARE w ere devised for this work, to chec k whether res is eﬃcien t on such tree structure and prob e diﬀeren t simulation runtime limits; FTPP and HECS were tak en from the literature on ft a [21, 52]; and RC, HVC, and R W C concern industrial railroad systems [26, 25, 46]. The structure of all these systems is presen ted in Fig. 7; the fail, repair, and dormancy pdf s of their BE s and SBE s are giv en in T able 4. V OT (Fig. 7a) The ﬁrst case study is a binary AND gate whose children are V OT gates, whose c hildren are basic elemen ts. All BE s are connected to a single RBO X , which ﬁrst repairs children of VOT - A and, if all are op erational, then repairs any failed c hild of VOT - B . VOT - A is a VOT k A gate with n A c hildren, and analogously for V OT - B , where n A = n B − 1 , k A = n A − 3 , and k B = n B − 2 . F rom the n A c hildren of V OT - A , n B − 4 BE s are also children of V OT - B . V OT is parameterised on n B = 8 , 9 , 10 resp. for VOT- { 2 , 3 , 4 } . Thus in VOT-2 in Fig. 7a, V OT - A is a V OT 4 with 7 c hildren, and V OT B is a V OT 6 with 8 c hildren. DSP ARE (Fig. 7b) This is a ternary AND gate whose children are SP ARE gates, whose children are basic elements. SBE s are shared among all SP ARE s: eac h gate has a unique primary BE and n spare BE s. The parameterisation is on n ∈ { 3 , 4 , 5 } : Fig. 7b shows DSP ARE-3. All basic elements are connected to a single RBO X , whose priority is to ﬁrst repair failed SBE s and, if all are op erational, then repair failed BE s. HECS (Fig. 7d) The Hyp othetical Example Computer System is a classic case study from ft a literature [52]. W e study the v ariant with t wo memory-unit in terfaces, that aﬀect the memory banks via functional dep endencies. F urther- more, we deﬁne one RBO X for eac h subsystem: In terface, Memory , Pro cessors, and Bus. HECS is parameterised on the num ber of parallel buses ( BE s B k ) and shared spare pro cessors ( SBE s PS b ). F or n ∈ { 1 , 2 , 3 , 4 , 5 } HECS- n has n shared spare pro cessors and 2 n parallel buses; Fig. 7d depicts HECS-2. FTPP (Fig. 7f) The F ault T olerant P arallel Pro cessor is another classic ft . W e implemen ted the group ed cold-spare v ariant from [21], where all triads dep end on all net work elements, and there is an indep endent SBE per triad. W e study an individual triad: the tree root is thus a V OT 2 with 3 SP ARE s as children—see Fig. 7f. W e deﬁned indep endent repair b o xes for the netw ork and pro cessing elemen ts; the RBOX in c harge of pro cessors prioritises the repair of primary BE s. FTPP is parameterised on the num b er of (shared) SBE s of the triad. F or n ∈ { 4 , 5 , 6 } FTPP- n has n shared SBE s; Fig. 7f depicts FTPP-4. R C (Fig. 7e) The Relay Cabinets subsystem is one of the components of the railw ays cabinets example in Fig. 2. It is a V OT k gate with k + 2 children: SP ARE s with one (indep endent) SBE b esides their primary BE . There is a single RBO X for all basic elements, which prioritises repairs of primary BE s. R C is parameterised in the num ber of SP ARE s that need to fail to cause a top even t: k ∈ { 3 , 4 , 5 , 6 } ; Fig. 7e depicts RC-3. HV C (Fig. 7c) High V oltage Cabinets is the other main comp onent of the railw ays cabinet example. This is a VOT 2 gate with 4 SP ARE c hildren. Here 20 Fig. 7: Case studies used for experimentation BE ₄ BE ₇ BE ₈ BE ₁₁ BE ₁ BE ₃ 4/7 6/8 (a) VOT BE ₁ SBE ₃ SBE ₁ SBE ₂ BE ₂ BE ₃ SP ₁ SP ₃ SP ₂ (b) DSP ARE 2/4 HV C HV C ₁ BE ₁ HV C ₄ BE ₄ SBE ₁ SBE ₄ (c) HVC HECS SW HW IF PROC PROC1 PROC2 P ₁ P ₂ PS ₁ PS ₂ BUS B ₁ B ₄ MI ₂ MI ₁ MEM 3/5 MI ₃ M ₁ M ₄ M5 M ₂ M ₃ (d) HECS RC ₁ RC ₅ BE ₁ BE ₅ 3/5 RC SBE ₅ SBE ₁ (e) RC T ₁ T ₂ T ₃ B ₁ B ₂ B ₃ NE ₁ NE ₂ NE ₃ 2/3 FTPP SBE ₄ SBE ₁ (f ) FTPP RC ₁ BE ₁ SBE ₁ RC ₅ BE ₅ SBE ₅ HV C ₁ BE ₁ HV C ₄ BE ₄ SBE ₁ SBE ₄ 3/5 RC 2/4 HV C 2/3 RW C 3/6 MIX (g) R WC 21 ho wev er the SBE s are shared among all SP ARE s: HVC is parameterised in the amoun t of these SBE s, n ∈ { 4 , 5 , 6 , 7 } , with Fig. 7c depicting HV C-4. The single RBO X is analogous to the one in R C. R W C (Fig. 7g) The full Railwa ys Cabinet case study com bines R C and HVC with a VOT . The SP ARE s of RC are direct children of this gate, whereas the high v oltage cabinets are interfaced via an OR . The parameterisation of R W C, m ∈ { 1 , 2 , 3 , 4 } , com bines those of its subsystems. R WC- m uses R C- ( m + 1) and HV C- ( m + 2) ; Fig. 7g depicts R WC-2. T able 4: Basic elements of the case studies Basic element F ail time pdf Rep. pdf Dorm. pdf VOT: BE - A lnor(4 . 37 , 0 . 33) uni(0 . 4 , 0 . 95) BE - B wei(4 . 5 , 0 . 0125) uni(0 . 4 , 0 . 95) DSP ARE: BE exp(0 . 07) uni(1 . 0 , 2 . 0) SBE exp(0 . 07) uni(1 . 0 , 2 . 0) exp(0 . 035) HECS: SW exp(4 . 5 × 10 - 12 ) uni(28 . 0 , 56 . 0) HW exp(1 . 0 × 10 - 10 ) uni(28 . 0 , 56 . 0) MI i exp(5 . 0 × 10 - 9 ) uni(21 . 0 , 28 . 0) M j exp(6 . 0 × 10 - 8 ) uni(21 . 0 , 28 . 0) B k exp(8 . 7 × 10 - 4 ) lnor(4 . 45 , 0 . 24) P a exp(1 . 0 × 10 - 3 ) lnor(4 . 45 , 0 . 24) PS b exp(1 . 5 × 10 - 3 ) lnor(4 . 45 , 0 . 24) dir( ℵ ) FTPP: NE i lnor(6 . 5 , 0 . 5) nor(150 . 0 , 50 . 0) B j exp(2 . 8 × 10 - 2 ) nor(15 . 0 , 3 . 0) SBE k exp(2 . 8 × 10 - 2 ) nor(15 . 0 , 3 . 0) dir( ℵ ) RC: BE i exp(0 . 04) nor(2 . 0 , 0 . 7) SBE j exp(0 . 04) nor(2 . 0 , 0 . 7) exp(0 . 5) HVC: BE i ray(1 . 999) uni(0 . 15 , 0 . 45) SBE j ray(1 . 999) uni(0 . 15 , 0 . 45) erl(3 . 0 , 0 . 25) Abbrev: Distribution: dir( x ) Dirac( x ) exp( λ ) exp onential( λ ) erl( k, λ ) Erlang( k , λ ) uni( a, b ) uniform([ a, b ] R ) ray( σ ) Rayleigh( σ ) wei( k, λ ) W eibull( k , λ ) nor( µ, σ ) normal( µ, σ ) lnor( µ, σ ) log - normal( µ, σ ) 6.4 Results of experimentation: comparing CI widths Using smc and rest ar t we computed UNA V A for VOT- { 2 , 3 , 4 } , HECS- { 1 , 2 , 3 , 4 , 5 } , RC- { 3 , 4 , 5 , 6 } , and R WC- { 1 , 2 , 3 , 4 } . fe was not used since it re- 22 Fig. 8: ci precision for system unav ailability 1e-07 1e-06 1e-05 1e-04 1e-03 2 3 4 SMC 10 10 0 RST-2 10 10 5 RST-3 10 10 3 RST-5 10 10 3 RST-8 10 1 2 RST-11 10 10 5 (a) VOT 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1 2 3 4 5 SMC 10 10 8 2 3 RST-ES 10 10 10 10 10 RST-2 10 10 10 10 10 RST-5 10 10 10 10 9 RST-8 10 10 9 10 1 RST-11 10 10 9 8 10 (b) HECS 1e-07 1e-06 1e-05 1e-04 3 4 5 6 SMC 10 10 10 10 RST-ES 10 0 10 10 RST-2 10 10 10 10 RST-5 10 10 10 10 RST-8 10 10 10 10 RST-11 10 10 10 10 (c) RC 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1 2 3 4 SMC 7 10 6 6 RST-2 8 10 10 10 RST-5 7 10 10 10 RST-8 5 10 10 10 RST-11 6 9 10 10 (d) R WC 23 quires regeneration theory for steady-state analysis [23], whic h is not alw a ys feasible with non-Marko vian mo dels. The mean widths of the ci s achiev ed per instance (ulting 95% conﬁdence lev el) are shown in Fig. 8. F or example for VOT-2 (Fig. 8a), 10 indep endent computations with smc ran in caserta for 5 min, and all conv erged to not-null ci s ( 10 ). The mean width of these ci s w as 1 . 40 × 10 - 4 and their standard deviation 7 . 96 × 10 - 6 . F or VOT-3, all smc computations yielded not-n ull ci s (after 30 min) with an av erage precision of 9 . 62 × 10 - 6 and standard deviation 1 . 52 × 10 - 6 . F or VOT-4 all smc sim ulations yielded null ci s after 3 hours of sim ulation. Instead, rst -2 conv erged to 10, 10, and 5 not-n ull ci s resp. for VOT-{2,3,4}, with mean widths (and standard devi- ation): 1 . 24 × 10 - 4 ( 1 . 19 × 10 - 5 ), 5 . 09 × 10 - 6 ( 1 . 48 × 10 - 6 ), and 1 . 79 × 10 - 7 ( 3 . 19 × 10 - 8 ). Th us for the VOT case study , rst -2 was consistently more eﬃcient than smc , and the eﬃciency gap increased as UNA V A b ecame rarer. This trend rep eats in all exp erimen ts: as exp ected, the rarer the metric, the wider the ci s computed in the time limit, until at some p oint it becomes very hard to con verge to not-null ci s at all (sp ecially for smc ). F or the least resilient conﬁguration of each case study , smc can b e competitive or ev en more eﬃcient than some isplit v arian ts. F or instance for VOT-1 and HECS-1 in Figs. 8a and 8b, all computations con v erged to not-n ull ci s for all algorithms, but smc exhibits less v ariable ci widths, viz. smaller whiskers. This is reasonable: truncat- ing and splitting traces in rest ar t adds ( i ) sim ulation o verhead that may not pa y oﬀ to estimate not-so-rare ev ents, and on top of it ( ii ) correlations of cloned traces that share a common history , increasing the v ariability among indep en- den t runs. On the other hand and as expected, smc looses this comp etitiveness for all case studies as failures b ecome rarer, here when UNA V A 6 1 . 0 × 10 - 5 . This holds nicely for the biggest case studies: HECS-5 ‡ (a 42-no des rft whose iosa has 126-not-clo ck v ariables ≈ 2 . 89 × 10 38 states, with 57 clo cks of exp o- nen tial, uniform, and log-normal pdf s) and R WC-4 (42 no des, 181 v ariables ≈ 6 . 93 × 10 73 states, 62 clo cks of exponential, Erlang, Ra yleigh, uniform, and normal pdf s). W e also estimated the UNREL 1000 of DSP ARE-{3,4,5}, R W C-{2,3,4}, FTPP- {4,5,6}, HV C-{4,. . . ,7}, and HECS-{2,. . . ,5} using smc , rest ar t , and fe . F or HVC (only) w e ran 20 exp eriments p er tree, 10 in each cluster node. Fig. 9 sho ws the results. The ov erall trend sho wn for unreliability estimations is similar to the previous una v ailability cases. Here ho wev er it was p ossible to use Fixed Eﬀort, since ev ery sim ulation has a clearly deﬁned end at time T = 10 3 . It is in teresting th us to compare the eﬃciency of rest ar t vs. fe : we note for example that some v ariants of fe p erformed considerably b etter than any other approach in the most resilien t conﬁgurations of FTPP and HECS. It is nevertheless diﬃcult to dra w general conclusions from Figs. 9a to 9b, since some v ariants that p erformed b est in a case study—e.g. fe -16 in HECS—did worse in others—e.g. FTPP , where the b est algorithms were fe -8,12. F urthermore, fe -8, which is alwa ys ‡ rst -8 for HECS-5 escapes this trend: analysing the execution logs it was found that FIG crashed during the second computation. 24 Fig. 9: ci precision for system unreliability 1e-05 1e-04 1e-03 3 4 5 SMC 10 10 10 RST-2 10 10 10 RST-3 10 10 10 RST-5 10 10 10 RST-8 10 10 10 RST-11 10 10 10 FE-8 10 10 10 FE-16 10 10 10 FE-32 10 10 10 (a) DSP ARE 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 2 3 4 5 SMC 10 10 8 1 RST-2 10 10 7 3 RST-5 10 10 10 0 RST-8 10 10 8 1 RST-11 10 10 6 1 FE-8 10 10 10 9 FE-16 10 10 10 10 FE-32 10 10 1 1 (b) HECS 1e-06 1e-05 1e-04 1e-03 1e-02 4 5 6 SMC 10 10 3 RST-2 10 10 3 RST-5 10 10 1 FE-8 10 10 10 FE-12 10 10 10 FE-16 10 10 6 FE-24 1 10 3 (c) FTPP 1e-04 1e-03 1e-02 1e-01 2 3 4 SMC 10 10 10 RST-2 10 10 10 RST-5 10 10 10 RST-8 10 10 10 RST-11 10 10 10 (d) R WC 1e-05 1e-04 1e-03 1e-02 1e-01 4 5 6 7 SMC 10 10 6 6 RST-2 16 15 15 12 RST-5 15 19 16 17 RST-8 19 17 16 13 RST-11 18 18 18 15 FE-8 0 6 14 10 (e) HVC 25 b etter than smc when UNREL 1000 < 10 − 3 , did not perform very w ell in HV C, where the algorithms that achiev ed the narrow est and most not-n ull ci s were rst -5,11. Such cases not withstanding, fe is a solid comp etitor of rest ar t in our b enc hmark. Another relev ant point of study is the optimal eﬀort e for rst -e or fe -e, whic h shows no clear trend in our exp eriments. Here, e is a “global eﬀort” used b y these algorithms, equal for all S k regions. e also alters the wa y in which the thresholds selection algorithm Sequen tial Monte Carlo ( seq [13]) selects the ` k . The lack of guidelines to select a v alue for e that w orks well across diﬀerent systems was raised in [10]. This motiv ated the dev elopmen t of Expected Success ( es [11]), which selects eﬀorts individually p er S k (or ` k ). Thus, in rst-es , a trace up-crossing threshold ` k is split according to the individual eﬀort e k selected b y es . In the benchmark of [11], whic h consists mostly of queueing systems, es was shown sup erior to seq . How ev er, exp erimental outcomes on dft s in this work are diﬀeren t: for UNA V A , rst-es yielded mildly go o d results for HECS and RC; for the other case studies and for all UNREL 1000 exp erimen ts, rst-es alw ays yielded n ull ci s. It w as found that the eﬀort selected for most thresholds ` k w as either to o small—so splitting in e k w as not enough for the rst-es trace to reach ` k +1 —or to o large—so there was a splitting/truncation o verhead. This point is further addressed in the conclusions. Bey ond comparisons among the sp eciﬁc algorithms, b e these for res or for selecting thresholds, it seems clear that our approach to ft a via isplit de- plo ys the exp ected results. F or eac h parameterised case study C S p , w e could ﬁnd a v alue of the parameter p where the level of resilience is suc h, that smc is less eﬃcien t than our automatically-constructed isplit framework. This is partic- ularly signiﬁcant for big dft s like HECS and R WC, whose complex structure could b e exploited b y our imp ortance function. 7 Related work Most w ork on dft analysis assumes discrete [52, 4] or exponentially distributed [17, 35] comp onents failure. F urthermore, comp onents repair is seldom studied in conjunction with dynamic gates [7, 4, 48, 35, 37]. In this work w e address repairable dft s, whose failure and repair times can follow arbitrary pdf s. More in detail, rft s w ere ﬁrst formally in tro duced as sto chastic P etri nets in [7, 40]. Our work stands on [38], which reviews [40] in the con text of sto chastic automata with arbitrary pdf s. In particular we also address non-Marko vian con tinuous distributions: in Sec. 6 w e exp erimented with exp onential, Erlang, uniform, Rayleigh, W eibull, normal, and log-normal pdf s. F urthermore and for the ﬁrst time, w e consider the application of [40, 38] to study rare even ts. Muc h eﬀort in res has b een dedicated to study highly reliable systems, de- plo ying either importance splitting or sampling. T ypically , imp ortance sampling can b e used when the system takes a particular shape. F or instance, a common assumption is that all failure (and repair) times are exp onentially distributed 26 with parameters λ i , for some λ ∈ R and i ∈ N > 0 . In these cases, a fav ourable c hange of measure can b e computed analytically [24, 28, 39, 41, 58, 47]. In contrast, when the fail/repair times follow less-structured distributions, imp ortance splitting is more easily applicable. As long as a full system failure can b e brok en down into several smaller components failures, an imp ortance splitting metho d can be devised. Of course, its eﬃciency relies heavily on the c hoice of imp ortance function. This choice is typically done ad ho c for the model under study [53, 36, 55]. In that sense [30, 31, 12, 13] are among the ﬁrst to attempt a heuristic deriv ation of all parameters required to implement splitting. This is based on formal sp eciﬁcations of the model and prop ert y query (the dep endabilit y metric). Here w e extended [12, 13, 10], using the structure of the fault tree to deﬁne comp osition op erands. With these operands w e aggregate the automatically-computed lo cal imp ortance functions of the tree no des. This aggregation results in an imp ortance function for the whole mo del. 8 Conclusions W e ha ve presen ted a theory to deploy automatic imp ortanc e splitting ( isplit ) for fault tree analysis of repairable dynamic fault trees ( rft s). This Rare Ev ent Sim ulation approac h supp orts arbitrary probabilit y distributions of components failure and repair. The core of our theory is an importance function I 4 deﬁned structurally on the tree. F rom suc h function w e implemented isplit algorithms, and used them to estimate the unr eliability and unavailability of highly-resilien t rft s. Departing from classical approaches, that deﬁne imp ortance functions ad ho c using expert knowledge, our theory computes all metadata required for res from the mo del and metric sp eciﬁcations. Nonetheless, we hav e sho wn that for a ﬁxed simulation time budget and in the most resilien t rft s, diverse isplit algorithms can be automatically implemen ted from I 4 , and alw a ys conv erge to narro wer conﬁdence in terv als than standard Mon te Carlo simulation. There are sev eral paths op en for future dev elopment. First and foremost, w e are looking into new w a ys to deﬁne the imp ortance function, e.g. to co ver more general categories of ft s such as fault maintenance trees [44]. It w ould also b e in teresting to lo ok into p ossible correlations among sp eciﬁc res algorithms and tree structures, that yield the most eﬃcien t estimations for a particular metric. Moreo ver, w e hav e deﬁned I 4 based on the tree structure alone. It would be in teresting to further include sto c hastic information in this phase, and not only afterw ards during the thresholds-selection phase. Regarding thresholds, the relativ ely bad p erformance of the Exp ected Success algorithm shows a sp ot for impro vemen t. In general, we believe that enhancing its statistical prop erties should alleviate the b ehaviour mentioned in Sec. 6.4. Moreo ver, techniques to increase trace indep endence during splitting (e.g. re- sampling) could further impro ve the p erformance of the isplit algorithms. Fi- nally , we are in vestigating enhancements in iosa and our tool c hain, to exploit the ratio b et ween fail and dormancy pdf s of SBE s in w arm SP ARE gates. 27 A c knowledgmen ts The authors thank José and Manuel Villén-Altamirano for fruitful discussions, who help ed to b etter understand the application scop e of our approac h. References 1. Abate, A., Budde, C.E., Cauchi, N., Hoque, K.A., Stoelinga, M.: Assessment of main tenance p olicies for smart buildings: Application of formal metho ds to fault main tenance trees. PHM Society Europ ean Conference 4 (1) (2018) 2. Baier, C., Kato en, J., Hermanns, H.: Appro ximate sym b olic model chec king of con tinuous-time marko v c hains. In: CONCUR 1999. pp. 146–161 (1999). h ttps://doi.org/10.1007/3-540-48320-9_12 3. Ba yes, A.J.: Statistical techniques for simulation models. Australian computer jour- nal 2 (4), 180–184 (1970) 4. Beccuti, M., Raiteri, D., F ranceschinis, G., Haddad, S.: Non deterministic re- pairable fault trees for computing optimal repair strategy . In: V ALUETOOLS 2008 (2010). https://doi.org/10.4108/ICST.V ALUETOOLS2008.4411 5. Blanc het, J., Mandjes, M.: Rare ev ent simulation for queues. In: R ubino and T uﬃn [43], pp. 87–124. https://doi.org/10.1002/9780470745403.c h5 6. Blom, H.A.P ., Bakk er, G.J.B., Krystul, J.: Rare even t estimation for a large-scale sto c hastic hybrid system with air traﬃc application. In: R ubino and T uﬃn [43], pp. 193–214. h ttps://doi.org/10.1002/9780470745403.ch9 7. Bobbio, A., Raiteri, D.: P arametric fault trees with dynamic gates and repair b oxes. In: RAMS’04. pp. 459–465 (2004). https://doi.org/10.1109/RAMS.2004.1285491 8. Boudali, H., Crouzen, P ., Hav erkort, B.R., Kuntz, M., Sto elinga, M.: Archi- tectural dep endability ev aluation with arcade. In: DSN’08. pp. 512–521 (2008). h ttps://doi.org/10.1109/DSN.2008.4630122 9. Boudali, H., Dugan, J.B.: A new bay esian netw ork approach to solv e dynamic fault trees. In: Annual Reliabilit y and Maintainabilit y Symp osium, 2005. Pro ceedings. pp. 451–456 (2005). https://doi.org/10.1109/RAMS.2005.1408404 10. Budde, C.E.: Automation of Imp ortance Splitting T echniques for Rare Even t Simu- lation. Ph.D. thesis, Universidad Nacional de Córdoba, Córdoba, Argentina (2017) 11. Budde, C.E., D’Argenio, P .R., Hartmanns, A.: Better automated imp ortance split- ting for transien t rare even ts. In: SETT A. LNCS, v ol. 10606, pp. 42–58. Springer (2017) 12. Budde, C.E., D’Argenio, P .R., Hermanns, H.: Rare ev ent simulation with fully automated imp ortance splitting. In: EPEW 2015. LNCS, v ol. 9272, pp. 275–290. Springer (2015). h ttps://doi.org/10.1007/978-3-319-23267-6_18 13. Budde, C.E., D’Argenio, P .R., Mon ti, R.E.: Compositional construction of impor- tance functions in fully automated importance splitting. In: V ALUETOOLS 2016. pp. 30–37 (2017). https://doi.org/10.4108/eai.25-10-2016.2266501 14. Budde, C.E., Dehnert, C., Hahn, E.M., Hartmanns, A., Junges, S., T urrini, A.: JANI: Quantitativ e model and to ol interaction. In: T ACAS. LNCS, vol. 10206, pp. 151–168. Springer (2017). https://doi.org/10.1007/978-3-662-54580-5_9 15. Coppit, D., Sulliv an, K.J., Dugan, J.B.: F ormal seman tics of models for compu- tational engineering: a case study on dynamic fault trees. In: ISSRE 2000. pp. 270–282 (2000). h ttps://doi.org/10.1109/ISSRE.2000.885878 28 16. Coppit, D., Sulliv an, K.J.: Galileo: A to ol built from mass-mark et applications. In: Soft ware Engineering, 2000. Pro ceedings of the 2000 International Conference on. pp. 750–753. IEEE (2000) 17. Crouzen, P ., Boudali, H., Sto elinga, M.: Dynamic fault tree analysis using input/output interactiv e mark ov c hains. In: DSN 2007. pp. 708–717 (2007). h ttps://doi.org/10.1109/DSN.2007.37 18. D’Argenio, P .R., Lee, M.D., Monti, R.E.: Input/output sto chastic automata - com- p ositionalit y and determinism. In: FORMA TS 2016. LNCS, v ol. 9884, pp. 53–68 (2016). https://doi.org/10.1007/978-3-319-44878-7_4 19. D’Argenio, P .R., Monti, R.E.: Input/Output Sto chastic Automata with Urgency: Conﬂuence and w eak determinism. In: ICT AC. LNCS, vol. 11187, pp. 132–152. Springer (2018). h ttps://doi.org/10.1007/978-3-030-02508-3_8 20. Distefano, S., Puliaﬁto, A.: Dependability modeling and analysis in dynamic sys- tems. In: 2007 IEEE International P arallel and Distributed Pro cessing Symp osium. pp. 1–8 (2007). https://doi.org/10.1109/IPDPS.2007.370601 21. Dugan, J.B., Bavuso, S.J., Bo yd, M.A.: F ault trees and sequence dep endencies. In: Ann ual Pro ceedings on Reliabilit y and Maintainabilit y Symp osium. pp. 286–293 (1990). https://doi.org/10.1109/ARMS.1990.67971 22. Garv els, M.J.J., v an Ommeren, J.K.C.W., Kro ese, D.P .: On the importance func- tion in splitting simulation. Europ ean T ransactions on T elecommunications 13 (4), 363–371 (2002). h ttps://doi.org/10.1002/ett.4460130408 23. Garv els, M.J.J.: The splitting metho d in rare ev ent sim ulation. Ph.D. thesis, De- partmen t of Computer Science, Universit y of T wen te (2000) 24. Go yal, A., Shahabuddin, P ., Heidelb erger, P ., Nicola, V.F., Glynn, P .W.: A uniﬁed framework for sim ulating mark ovian models of highly de- p endable systems. IEEE T ransactions on Computers 41 (1), 36–51 (1992). h ttps://doi.org/10.1109/12.123381 25. Guc k, D., Sp el, J., Stoelinga, M.: DFTCalc: Reliability centered main tenance via fault tree analysis (to ol paper). In: ICFEM 2015. pp. 304–311. Springer (2015) 26. Guc k, D., Kato en, J.P ., Sto elinga, M., Luiten, T., Romijn, J.: Smart railroad main- tenance engineering with sto chastic mo del c hecking. In: Proceedings of the Sec- ond In ternational Conference on Railwa y T echnology: Researc h, Dev elopment and Main tenance, Railwa ys 2014. Civil-Comp Pro ceedings, Civil-Comp Press (2014). h ttps://doi.org/10.4203/ccp.104.299 27. Hansson, H., Jonsson, B.: A logic for reasoning ab out time and reliability . F ormal Asp ects of Computing 6 (5), 512–535 (1994). h ttps://doi.org/10.1007/BF01211866 28. Heidelb erger, P .: F ast sim ulation of rare even ts in queueing and relia- bilit y mo dels. ACM T rans. Mo del. Comput. Simul. 5 (1), 43–85 (1995). h ttps://doi.org/10.1145/203091.203094 29. Iglewicz, B., Hoaglin, D.: Ho w to Detect and Handle Outliers. ASQC basic refer- ences in qualit y con trol, ASQC Quality Press (1993) 30. Jegourel, C., Legay , A., Sedw ards, S.: Imp ortance splitting for statistical mo del c hecking rare properties. In: CA V. pp. 576–591. Springer Berlin Heidelberg (2013) 31. Jégourel, C., Legay , A., Sedwards, S., T raonouez, L.: Distributed veriﬁca- tion of rare prop erties with light weigh t importance splitting observers. CoRR abs/1502.01838 (2015), 32. Junges, S., Guc k, D., Kato en, J.P ., Rensink, A., Stoelinga, M.: F ault trees on a diet. In: SETT A 2015. LNCS, vol. 9409, pp. 3–18. Springer (2015). h ttps://doi.org/10.1007/978-3-319-25942-0_1 33. Junges, S., Guck, D., Kato en, J., Stoelinga, M.: Uncov ering dynamic fault trees. In: DSN 2016. pp. 299–310. IEEE (2016). https://doi.org/10.1109/DSN.2016.35 29 34. Kahn, H., Harris, T.E.: Estimation of particle transmission by random sampling. National Bureau of Standards applied mathematics series 12 , 27–30 (1951) 35. Kato en, J.P ., Sto elinga, M.: Bo osting F ault T ree Analysis by F ormal Metho ds, pp. 368–389. Springer (2017). https://doi.org/10.1007/978-3-319-68270-9_19 36. L’Ecuy er, P ., Le Gland, F., Lezaud, P ., T uﬃn, B.: Splitting techniques. In: R ubino and T uﬃn [43], pp. 39–61. https://doi.org/10.1002/9780470745403.c h3 37. Liu, Y., W u, Y., Kalbarczyk, Z.: Smart maintenance via dynamic fault tree anal- ysis: A case study on singapore MR T system. In: DSN 2017. pp. 511–518 (2017). h ttps://doi.org/10.1109/DSN.2017.50 38. Mon ti, R.E.: Stochastic A utomata for F ault T olerant Concurren t Systems. Ph.D. thesis, Universidad Nacional de Córdoba, Argen tina (2018) 39. Nicola, V.F., Shahabuddin, P ., Naka yama, M.K.: T echniques for fast simulation of models of highly dep endable systems. IEEE T ransactions on Reliabilit y 50 (3), 246–264 (2001). h ttps://doi.org/10.1109/24.974122 40. Raiteri, D., Iacono, M., F ranceschinis, G., Vittorini, V.: Repairable fault tree for the automatic ev aluation of repair p olicies. In: DSN 2004. pp. 659–668 (2004). h ttps://doi.org/10.1109/DSN.2004.1311936 41. Ridder, A.: Imp ortance sampling simulations of mark ovian reliability systems using cross-entrop y . Annals of Operations Researc h 134 (1), 119–136 (2005). h ttps://doi.org/10.1007/s10479-005-5727-9 42. R ubino, G., T uﬃn, B.: Introduction to rare ev ent simulation. In: Rare Ev ent Simulation Using Monte Carlo Metho ds [43], pp. 1–13. h ttps://doi.org/10.1002/9780470745403.ch1 43. R ubino, G., T uﬃn, B. (eds.): Rare Ev ent Sim ulation Using Mon te Carlo Metho ds. John Wiley & Sons, Ltd (2009) 44. R uijters, E., Guc k, D., Drolenga, P ., P eters, M., Sto elinga, M.: Main tenance anal- ysis and optimization via statistical model c hecking - ev aluating a train pneumatic compressor. In: QEST 2016. LNCS, vol. 9826, pp. 331–347. Springer (2016) 45. R uijters, E., Guc k, D., v an Noort, M., Stoelinga, M.: Reliabilit y-cen tered mainte- nance of the electrically insulated railwa y join t via fault tree analysis: A practical exp erience rep ort. In: DSN 2016. pp. 662–669. IEEE Computer So ciety (2016) 46. R uijters, E., Reijsbergen, D., de Boer, P .T., Sto elinga, M.: Rare Ev ent Sim ulation for Dynamic Fault Trees. In: Computer Safety , Reliability , and Security . pp. 20–35. Springer International Publishing (2017) 47. R uijters, E., Reijsb ergen, D., de Bo er, P .T., Sto elinga, M.: Rare even t simulation for dynamic fault trees. Reliability Engineering & System Safet y 186 , 220–231 (2019). https://doi.org/10.1016/j.ress.2019.02.004 48. R uijters, E., Stoelinga, M.: F ault tree analysis: A survey of the state-of-the-art in mo deling, analysis and to ols. Computer science review 15 , 29–62 (2015) 49. Sc hassb erger, R.: Insensitivity of Steady-State Distributions of General- ized Semi-Marko v Pro cesses. Part I. Ann. Probab. 5 (1), 87–99 (1977). h ttps://doi.org/10.1214/aop/1176995893 50. Sulliv an, K.J., Dugan, J.B.: Galileo user’s man ual & design ov erview. https: //www.cse.msu.edu/~cse870/Materials/FaultTolerant/manual- galileo.htm (1998), v2.1-alpha 51. Sulliv an, K., Dugan, J., Coppit, D.: The Galileo fault tree analysis to ol. In: 29th Ann ual In ternational Symposium on F ault-T olerant Computing (Cat. No.99CB36352). pp. 232–235 (1999). h ttps://doi.org/10.1109/FTCS.1999.781056 52. V esely , W., Stamatelatos, M., Dugan, J., F ragola, J., Minaric k, J., Railsback, J.: F ault tree handb o ok with aerospace applications. NASA Oﬃce of Safety and Mis- sion Assurance (2002), version 1.1 30 53. Villén-Altamirano, J.: REST AR T method for the case where rare even ts can o ccur in retrials from any threshold. In t. J. Electron. Comm un. 52 , 183–189 (1998) 54. Villén-Altamirano, J.: Imp ortance functions for REST AR T simula- tion of highly-dependable systems. Simulation 83 (12), 821–828 (2007). h ttps://doi.org/10.1177/0037549707081257 55. Villén-Altamirano, J.: REST AR T vs splitting: A comparative study . P erformance Ev aluation 121-122 , 38–47 (2018). h ttps://doi.org/10.1016/j.p ev a.2018.02.002 56. Villén-Altamirano, M., Martínez-Marrón, A., Gamo, J., F ernández-Cuesta, F.: En- hancemen t of the accelerated simulation metho d REST AR T b y considering mul- tiple thresholds. In: Pro c. 14 th In t. T eletraﬃc Congress, T eletraﬃc Science and Engineering, v ol. 1, pp. 797–810. Elsevier (1994). https://doi.org/10.1016/B978-0- 444-82031-0.50084-6 57. Villén-Altamirano, M., Villén-Altamirano, J.: REST AR T: a metho d for accelerat- ing rare even t simulations. In: Queueing, Performance and Control in A TM (ITC- 13). pp. 71–76. Elsevier (1991) 58. Xiao, G., Li, Z., Li, T.: Dep endability estimation for non-marko v consecutive-k- out-of-n: F repairable systems by fast simulation. Reliabilit y Engineering & System Safet y 92 (3), 293–299 (2007). https://doi.org/10.1016/j.ress.2006.04.004 31

Rare Event Simulation for non-Markovian repairable Fault Trees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment