SOS: Safe, Optimal and Small Strategies for Hybrid Markov Decision Processes

SOS: Safe, Optimal and Small Strategies for Hybrid Mark o v Decision Pro cesses Prana v Ashok 1 , Jan Kretinsky 1 , Kim Guldstrand Larsen 2 , Adrien Le Co¨ en t 2 , Jak ob Haahr T aankvist 2 , and Maximilian W eininger 1 1 T ec hnical Universit y of Munic h, German y 2 Aalb org Univ ersity , Denmark Abstract. F or hybrid Marko v decision pro cesses, Upp aal Stratego can compute strategies that are safe for a given safety prop ert y and (in the limit) optimal for a given cost function. Unfortunately , these strategies cannot b e exp orted easily since they are computed as a very long list. In this pap er, we demonstrate metho ds to learn compact representations of the strategies in the form of decision trees. These decision trees are muc h smaller, more understandable, and can easily be exp orted as co de that can b e loaded in to embedded systems. Despite the size compression and actual diﬀerences to the original strategy , we pro vide guarantees on b oth safet y and optimalit y of the decision-tree strategy . On the top, we show ho w to obtain y et smaller representations, which are still guaranteed safe, but achiev e a desired trade-oﬀ b et ween size and optimality . 1 In tro duction Cyb er-ph ysical systems often are safety-critical and hence strong guarantees on their safety are paramoun t. F urthermore, resource eﬃciency and the qualit y of the delivered service are strong requiremen ts; the b eha viour needs to b e optimized with resp ect to these ob jectiv es, while of course staying within the b ounds of what is still safe. In order to ac hieve this, con trollers of such systems can be either implemen ted man ually or automatically synthesized. In the former case, due to the complexity of the system, coming up with a con troller that is safe is diﬃcult, ev en more so with the additional optimization requirement. In the latter case, the synthesis may succeed with signiﬁcantly less eﬀort, though the requiremen t on b oth sa fet y and optimality is still a c hallenge for curren t synthesis methods. Ho wev er, due to the size of the systems, the produced controllers may b e very complex, hard to understand, implement, mo dify , or ev en just output. Indeed, even for mo derately sized systems, w e can easily end up with gigabytes-long descriptions of their con trollers (in the algorithmic context called strategies). In this pap er, w e sho w ho w to provide a more compact representation, which can yield acceptably short and simple co de for resource-limited embedded devices, and consequently can b e more easily understo o d, maintained, mo diﬁed, debugged, and the requirements are b etter traceable in the ﬁnal controller. T o this end, as the formalism for the compact represen tation w e c ho ose decision trees [42]. This represen tation is typically sev eral orders of magnitude smaller than the classical explicit description and also is known for its interpretabilit y and understandability [42, 48, 8]. The resulting enco ded strategy may diﬀer from the original one, but Distance V elocityEgo AccelerationEgo V elocityF ront AccelerationF ront Fig. 1: The tw o cars, Ego and F r ont . W e control Ego and the environmen t con trols F r ont . Both cars hav e an acceleration and a velocity . In addition, we know the distance b et ween the cars. despite that and despite b eing smaller, it is still guaran teed to b e safe and as nearly-optimal as the original one. Moreov er, we can trade oﬀ additional decrease in size for decrease in p erformance (getting farther from optimum) to a desired degree, while main taining safet y . Example 1. As a running example and one of the case studies, we use the following example introduced in [36] and expanded with stronger safet y guaran tees in [35]. W e consider tw o cars Ego and F r ont , depicted in Figure 1. W e con trol Ego , whereas F r ont is controlled b y the en vironment. Ego is driving b ehind F r ont , and b oth cars hav e a discrete input (the acceleration) and a contin uous state (the v elo cit y). The goal of the adaptiv e cruise control in Ego is, ﬁrst, to stay safe (b y k eeping the distance b et ween the cars greater than a giv en safe distance), and second, to drive as close to F r ont as p ossible, i.e. to optimize the aggregated distance b et ween the cars. W e use Upp aal Tiga [1] to get a safe strategy for Ego as in [35], and then Upp aal Stra tego [18] to learn a (near-)optimal strategy for a desired cost function, giv en the constrain ts from the safe strategy . The resulting strategy is output as a list with almost 6 million conﬁgurations. Using our new metho ds, we obtain a decision tree representing the strategy , that has only ab out 2713 no des. Additionally , w e can trade p erformance to reduce the size even further, e.g. b y increasing the aggregated distance reasonably we can reduce the size to 1247 no des. Our c ontribution: – W e design and implement a framework Stra tego + to transform safe and (near-)optimal strategies in to their decision-tree representation, preserving safet y and the same level of optimality , while b eing muc h smaller. – W e provide sev eral transformations and wa ys to y et further decrease the size while preserving safet y , but relaxing the optimality to a desired extent. – W e test our metho ds on three case studies, where we show size reductions of up to three orders of magnitude, and quantify the additional size-p erformance trade-oﬀ. Our techniques can b e used to represen t (ﬁnite-memory non-sto c hastic) strate- gies for arbitrary systems exhibiting non-determinism (e.g. Marko v decision pro- cesses, timed/concurren t/sto c hastic games). This pap er demonstrates the technique on h ybrid Marko v decision pro cesses, as that is the formalism used in Upp aal Stra tego . R elate d Work: The problem of computing strategies for h ybrid systems has b een ex- tensiv ely studied in the past years. Most approaches rely on abstraction techniques: the contin uous and inﬁnite state space of the system is represented with a ﬁnite n umber of sym b ols, e.g. discrete p oin ts [25, 51], sets of states [15], etc. Ho wev er, it is still hard to deal with uncontrollable comp onents, even though some approaches exist such as robust control [27], or contract-based design [52], but they usually consider the uncon trollable component as a b ounded perturbation and do not tac kle sto c hastic b eha viour. The to ol PESSOA [49, 39] can synthesize controllers for cyb er-ph ysical systems represen ted by a set of smo oth diﬀerential equations with a sp eciﬁcation in a fragment of Linear T emp oral Logic (L TL). Abstraction tec hniques are used in [28] for syn thesizing strategies for a class of hybrid systems that inv olve random phenomena together with discrete and con tinuous b eha viours. Discrete, sto chastic dynamical systems are considered in [55], where the synthesis of strategies with respect to L TL ob jectiv es is made p ossible with an abstraction- reﬁnemen t metho d. In [23] a n umber of b enc hmarks for hybrid system v eriﬁcation has b een prop osed, including a ro om heating benchmark. In [16] Upp aal SMC w as applied to the performance ev aluation of sev eral strategies prop osed in the b enc hmark. How ever, there was no fo cus on safety in this approac h. In our w ork, the safety strategy synthesis relies on a discretization of the contin uous v ariables, leading to a decidable problem that can b e handled by Upp aal Tiga , but we furthermore provide safety guarantees for the original system with the use of a Timed Game abstraction based on a guaran teed Euler scheme [37]. In artiﬁcial in telligence, compact (factored) represen tations of Marko v decision pro cesses (MDPs) hav e b een dev elop ed using dynamic Bay esian netw orks [6, 32], probabilistic STRIPS [34], algebraic decision diagrams [31], and also decision trees [6]. F or a detailed survey of compact representations see [4]. F ormalisms used to represen t MDPs can, in principle, b e used to represent strategies as w ell. In particular, v ariants of decision trees are probably the most used [6, 13, 33]. Decision trees hav e b een also used in connection with real-time dynamic programming and reinforcemen t learning [5, 45]. In the con text of veriﬁcation, MDPs are often represen ted using v ariants of (MT)BDDs [20, 29, 40], and strategies by BDDs [56]. Learning a compact decision- tree representation of a strategy has b een inv estigated in [38] for the case of b o dy sensor netw orks, in [8] for ﬁnite (discrete) MDPs, and in [9] for ﬁnite games, but only with Bo olean v ariables. Moreo ver, these decision trees can only predict a single action for a state conﬁguration whereas in this work, we allow the trees to predict more than one action for a single conﬁguration. In control theory , [57] pro ves that the problem of computing size-optimal determinisiation of con trollers is NP-complete and hence discuss v arious heuristic-based determinisation algorithms. None of these works consider the optimization asp ect, whic h b eing a soft constraint enables the trade-oﬀs. P ermissive strategies hav e been studied in e.g. [2, 7, 21]. 2 Preliminaries 2.1 Hybrid Mark o v Decision Processes W e describ e the mathematical mo delling framework. The correspondence to the Upp aal mo dels is straigh tforward. Deﬁnition 1 (HMDP). A hybrid Mark o v decision pro cess (HMDP) M is a tuple ( C, U, X, F , δ ) wher e: 1. the c ontr ol ler C is a ﬁnite set of (c ontr ol lable) mo des C = { c 1 , . . . , c k } , 2. the unc ontr ol lable envir onment U is a ﬁnite set of (unc ontr ol lable) mo des U = { u 1 , . . . , u l } , 3. X = ( x 1 , . . . , x n ) is a ﬁnite tuple of c ontinuous (r e al-value d) variables, 4. for e ach c ∈ C and u ∈ U , F c,u : R > 0 × R X → R X is the ﬂow-function that describ es the evolution of the c ontinuous variables over time in the c ombine d mo de ( c, u ) , and 5. δ is a family of pr ob ability functions δ γ : U → [0 , 1] , wher e γ = ( c, u, x ) is a glob al c onﬁgur ation. Mor e pr e cisely, δ γ ( u 0 ) is the pr ob ability that u in the glob al c onﬁgur ation γ = ( c, u, x ) wil l change to the unc ontr ol lable mo de u 0 . In the following, we denote by C the set of global conﬁgurations C × U × R X of the HMDP M . The ab o ve notion of HMDP actually describ es an inﬁnite-state Marko v Decision Pro cess [44], where c hoices of mo de for the controller is made p erio dically and choice of mo de for the uncontrollable environmen t is made probabilistically according to δ . Note that abstracting δ γ to the supp ort ˆ δ γ = { u | δ γ ( u ) > 0 } , turns M in to a (traditional) h ybrid tw o-play er game. The inclusion of δ allo ws for a probabilistic reﬁnement of the uncontrolled environmen t in this game. Such a reﬁnemen t is irrelev ant for the purp oses of guaranteeing safet y; how ever, it will b e useful for optimizing the cost of op erating the system. Indeed, rather than optimizing only the worst-case p erformance, we wish to optimize the ov erall exp ected b eha viour. Strategies A – memoryless and p ossibly non-deterministic – strategy σ for the con troller C is a function σ : C → 2 C , i.e. given the current conﬁguration γ = ( c, u, x ), the expression σ ( γ ) returns the set of allow ed actions in that conﬁguration; in our setting, the actions are the con trollable mo des to b e used for the duration of the next perio d. Non-deterministic strategies are also called p ermissiv e since they p ermit many actions instead of prescribing one. The evolution of system ov er time is deﬁned as follo ws. Let γ = ( c, u, x ) and γ 0 = ( c 0 , u 0 , x 0 ). W e write γ τ → γ 0 in case c 0 = c, u 0 = u and x 0 = F ( c,u ) ( τ , x ). A run is an interlea ved sequence π ∈ C × ( R × C × C × C ) ∗ of conﬁgurations and relative time-delays of some given p eriod P : π = γ o :: P :: α 1 :: β 1 :: γ 1 :: P :: α 2 :: β 2 :: γ 2 :: P :: · · · Then π is a run ac c or ding to the str ate gy σ if after each p eriod P the following sequence of discrete (instantaneous) changes are made: 1. the v alue of the contin uous v ariables are up dated according to the ﬂow of the curren t mo de, i.e. γ i − 1 = ( c i − 1 , u i − 1 , x i − 1 ) P → ( c i − 1 , u i − 1 , x i ) =: α i ; 2. the environmen t c hanges to any possible new mo de, i.e. β i = ( c i − 1 , u i , x i ) where δ α i ( u i ) > 0; 3. the controller changes mo de according to the strategy σ , i.e. γ i = ( c i , u i , x i ) with c i ∈ σ ( β i ). Safet y A strategy σ is said to b e safe with respect to a set of conﬁguration S ⊆ C , if for any run π according to σ all conﬁgurations encountered are within S , i.e. α i , β i , γ i ∈ S for all i and also γ 0 i ∈ S whenev er γ i τ → γ 0 i with τ ≤ P . Note that the notion of safet y do es not dep end on the actual δ , only on its supports. Recall that almost-sure safety , i.e. with probabilit y 1, coincides with sure safet y . W e use a guaran teed set-based Euler metho d introduced in [35] to ensure safet y of a strategy not only at the conﬁgurations where we make decisions, but also in the contin uum in b et ween them. W e refer the reader to App endix A.2 for a brief reminder of this metho d. Optimalit y Under a given deterministic (i.e. p ermitting one action in eac h conﬁguration) strategy σ the game M b ecomes a completely sto c hastic pro cess M  σ , inducing a probabilit y measure on sets of runs. In case σ is non-deterministic or p ermissive , the non-determinism in M  σ is resolved uniformly at random. On suc h a pro cess, w e can ev aluate a given optimization function. Let H ∈ N b e a giv en time-horizon, and D a random v ariable o n runs, then E M ,γ σ,H ( D ) ∈ R ≥ 0 is the exp ected v alue of D on the space of runs of M  σ of length 3 H starting in the conﬁguration γ . As an example of D , consider the in tegrated deviation of a con tinuous v ariable, e.g. distance b et w een Ego and F r ont , with respect to a given target v alue. Consequen tly , given a (memoryless non-deterministic) safet y strategy σ safe with resp ect to a given safety set S , we wan t to ﬁnd a deterministic sub-strategy 4 σ opt that optimizes (minimizes or maximizes) E M ,γ σ safe ,H ( D ). 2.2 Decision T rees F rom the persp ectiv e of machine learning, de cision tr e es (DT) [42] are a classiﬁcation to ol assigning classes to data points. A data p oin t is a d -dimensional v ector v = ( v 1 , v 2 , . . . , v d ) of features with each v i dra wing its v alue from some set D i . If D i is an ordered set, then the feature corresp onding to it is called or der e d or numeric al (e.g. velo city ∈ R ) and otherwise, it is called c ate goric al (e.g. c olor ∈ { r e d , gr e en , blue } ). A (multi-class) DT can represent a function f : Q d i =1 D i → A where A is a ﬁnite set of classes. A (single-lab el) DT o ver the domain D = Q d i =1 D i with lab els A is a tuple T = ( T , ρ, θ ), where T is a ﬁnite binary tree, ρ assigns to every inner no de 3 Note that there is a bijection b et ween length of the run and time, as the time b et w een eac h step, P , is constant. 4 i.e. a strategy that for every conﬁguration returns a (non-strict) subset of the actions allo wed by the safe strategy predicates of the form x i ∼ c where ∼ ∈ {≤ , = } , c ∈ D i , and θ assigns to ev ery leaf node a list of natural n umbers [ m 1 , m 2 , . . . , m | A | ]. F or every v ∈ D , there exists a de cision p ath from the ro ot no de to some leaf ` v . W e sa y that v satisﬁes a predicate ρ ( t ) if ρ ( t ) ev aluates to true when its v ariables are ev aluated as giv en b y v . Given v and an inner no de t with a predicate ρ ( t ), the decision path selects either the left or right c hild of t based on whether v satisﬁes ρ ( t ) or not. F or v from the training set, we say that the leaf node ` v c ontains v . Then m a of a leaf is the num b er of p oin ts contained in the leaf and classiﬁed a in the training set. F urther, the classes assigned by a DT to a data p oin t v (from or outside of the training set) are given by arg max θ ( ` v ) = { i | ∀ i, j ≤ | A | . θ ( ` v ) i ≥ θ ( ` v ) j } , i.e. the most frequen t classes in the resp ectiv e leaf. Decision trees ma y also predict sets of classes instead of a single class. Such a generalization (representing functions of the t yp e Q d i =1 D i → 2 A ) is called a multi-lab el decision tree. In these trees, θ assigns to every leaf no de a list of tuples [( n 1 , y 1 ), ( n 2 , y 2 ), . . . , ( n | A | , y | A | )] where n a , y a ∈ N are the num b er of data p oin ts in the leaf not lab elled by class a and lab elled b y class a , resp ectively . The (m ulti-lab el) classiﬁcation of a data p oin t is then typically given by the ma jority rule, i.e. it is classiﬁed as a if n a < y a . A DT may b e constructed using decision-tree learning algorithms such as ID3 [46], C4.5 [47] or CAR T [10]. These algorithms take as input a training set, i.e. a set of vectors whose classes are already known, and output a DT classiﬁer. The tree constructions start with a single ro ot no de containing all the data points of the training set. The learning algorithms explore all p ossible predicates p = x i ∼ c , whic h split the data p oin ts of this no de into tw o sets, X p and X ¬ p . The predicate that minimizes the sum of entropies 5 of the tw o sets is selected. These sets are added as child no des to the no de b eing split and the whole pro cess is rep eated, splitting each no de further and further until the entrop y of the no de b ecomes 0, i.e. all data p oin ts b elong to the same class. Suc h no des are called pur e no des. This construction is extended to the multi-label setting by some of the algorithms. A multi-label node is called pur e if there is at least one class that is endorsed b y all data p oin ts in that no de, i.e. ∃ a ∈ A : n a = 0. If the tree is gro wn un til all lea ves hav e zero entrop y , then the classiﬁer memorizes the training data exactly , leading to overﬁtting [42]. This migh t not be desirable if the classiﬁer is trained on noisy data or if it nee ds to predict classes of unkno wn data. The learning algorithms hence provide some parameters, known as hyp erp ar ameters , which ma y b e tuned to generalize the classiﬁer and impro ve its accuracy . Overﬁtting is not an issue in our setup where w e w ant to learn the strategy function (almost) precise ly . How ev er, we can use the hyperparameters to pro duce even smaller represen tations of the function, at the “exp ense” of not b eing entirely precise any more. One of the hyperparameters of interest in this pap er is the minimum split size k . It can be used to stop splitting no des once the n umber of data p oin ts in them b ecome smaller than k . By setting larger k , the size of the tree decreases, usually at the expense of increasing the entrop y of the 5 En tropy of a set X is H ( X ) = P a ∈ A p a log 2 ( p a ) + (1 − p a ) log 2 (1 − p a ), where p a is the fraction of samples in X belonging to class a . See [14] for more details. lea ves. There also exist sev eral pruning techniques [41, 22], which remov e either lea ves or entire subtrees after the construction of the DT. 2.3 Standard Uppaal Stratego W orkﬂow The pro cess of obtaining an optimized safe strategy σ opt using Upp aal Stra t- ego is depicted as the grey b o xes in Fig. 2. First, the HMDP M is abstracted into a 2-play er (non-sto chastic) timed game T G , ignoring an y sto c hasticit y of the be- ha viour. Next, Upp aal Tiga is used to synthesize a safe strategy σ safe : C → 2 C for T G and the safety sp eciﬁcation ϕ , which is sp eciﬁed using a simpliﬁed v ersion of timed computation tree logic (TCTL)[1]. After that, the safe strategy is applied on M to obtain M  σ safe . It is no w possible to p erform reinforcement learning on M  σ safe in order to learn a sub-strategy σ opt that will optimize a given quan- titativ e cost, given as any run-based expression con taining e.g. discrete v ariables, lo cations, clo c ks, hybrid v ariables. F or more details, see [17, 19]. 3 Stratego + In this section, we discuss the new Upp aal Stra tego + framew ork follo wing with each of its comp onen ts are elucidated. 3.1 New W orkﬂo w SOS M T G σ safe M  σ safe σ opt T opt T k,p σ safe M  T k,p σ safe σ k,p opt T k,p opt Upp aal Tiga Str ate go le arning Str ate go le arning DT le arning exact DT le arning k,p DT le arning exact Fig. 2: Upp aal Stra tego + w orkﬂow. The dark orange no des are the additions to the original workﬂo w, whic h now inv olve DT learning, the y ellow-shaded area delimits the desired safe, optimal, and small strategy representations. Upp aal Stra tego + extends the standard workﬂo w in tw o wa ys: Firstly , in the top ro w, w e generate the DT T opt that exactly represen ts σ opt , yielding a small represen tation of the strategy . The DT learning algorithm can mak e use of tw o (hyper-)parameters k and p whic h ma y b e used to prune the DT; this approac h is describ ed in Section 3.4. While pruning reduces the size of the DT, the resultan t tree no longer represents the strategy exactly . Hence it is not possible to prune a DT representing deterministic strategies, like in the case of the σ opt describ ed in the ﬁrst row of the w orkﬂow, as safet y w ould b e violated. Ho wev er, for our second extension we apply the DT learning algorithm to the non-deterministic, p ermissiv e strategy σ safe , resulting in T k,p σ safe . This DT is less p ermissiv e, thereb y smaller, since the pruning disallows certain actions; y et it still represents a safe strategy (details in Section 3.4). Next, as in the standard w orkﬂow, this less p ermissiv e safe strategy is applied to the game and Stra tego is used to get a near-optimal strategy σ k,p opt for the mo diﬁed game M  T k,p σ safe . In the end, w e again construct a DT exactly representing the optimal strategy , namely T k,p opt . Note that in the game restricted to T k,p σ safe few er actions are allo wed than when it is restricted only to σ safe , and hence the resulting strategy could p erform w orse. F or example, let σ safe allo w decelerating or remaining neutral for some conﬁguration, while T k,p σ safe pruned the p ossibilit y to remain neutral and only allo ws decelerating. Thus, σ opt remains neutral, whereas σ k,p opt has to decelerate and thereby increase the distance that we try to minimize. In b oth cases, the resulting DT is safe by construction since we allo w the DT to predict only pure actions (actions allow ed by all conﬁgurations in a leaf, see next section for the formal deﬁnition). W e conv ert these trees into a nested if-statemen ts co de, whic h can easily b e loaded onto embedded systems. 3.2 Represen ting strategies using DT A DT with domain C and lab els C can learn a (non-deterministic) strategy σ : C → 2 C . The strategy is provided as a list of tuples of the form ( γ , { a 1 , . . . , a k } ), where γ is a global conﬁguration and { a 1 , . . . , a k } is the set of actions p ermitted by σ . The training data points are giv en b y the integer conﬁgurations γ ∈ C (safet y for non-in teger p oints is guaranteed by the Euler metho d; see Section 2.1) and the set of classes for each γ is given by σ ( γ ). Consequently , a m ulti-lab el decision tree learning algorithm as described in Section 2.2 can be run on this dataset to obtain a tree T σ represen ting the strategy σ . Eac h no de of the tree contains the set of conﬁgurations that satisfy the decision path traced from the ro ot of the tree to the no de. The leaf attribute θ giv es, for eac h action a , the n umber of conﬁgurations in the leaf where the strategy disallo ws and allows a , resp ectiv ely . F or example, consider a no de with 10 conﬁgurations with θ = [(0 , 10) , (2 , 8) , (9 , 1)]. This means that the ﬁrst action is allow ed b y all 10 conﬁgurations in the no de, the second action is disallo wed b y 2 conﬁgurations and allo wed b y 8, and the third action is disallo wed b y 9 conﬁgurations and allo wed only by 1. Since we w ant the DT to exactly represen t the strategy , we need to run the learning algorithm un til the en tropy of all the leav es becomes 0, i.e. all conﬁgurations of the leaf agree on every action. More formally , giv en a leaf ` with n conﬁgurations we require θ ( ` ) = (0 , n ) or θ ( ` ) = ( n, 0) for every action. W e call an action that all conﬁgurations allow a pur e action . The table on the left of Fig. 3 shows a to y strategy . Based on v alues of distance d and velocity v , it permits a subset of the action set { de c , neu , ac c } . A corresponding DT enco ding is displa yed on the right of Fig. 3. distance velocity actions 2 51 { dec } 3 20 { dec } 5 30 { dec } 7 1 { dec, neu } 20 46 { dec, neu } 25 25 { dec, neu, ac c } 45 70 { dec, neu } [(0, 3), (3, 0), (3, 0)] distance ≤ 6 [(0, 2), (0, 2), (2, 0)] distance ≤ 22.5 [(0, 1), (0, 1), (0, 1)] distance ≤ 35 [(0, 1), (0, 1), (1, 0)] F alse T rue Fig. 3: A sample dataset (left); and a (multi-label) decision tree generated from the dataset (right). The leaf no des contain the list of tuples assigned by θ , the inner nodes con tain the predicates assigned by ρ 3.3 In terpreting DT as strategy T o extract a strategy from a DT, w e pro ceed as follo ws: Given a conﬁguration C , w e pick the leaf ` C asso ciated with it by ev aluating the predicates and following a path through the DT. Then we compute θ ( ` C ) = [( n 1 , y 1 ) , ( n 2 , y 2 ) , . . . , ( n | A | , y | A | )] where n a , y a ∈ N are the num b er of data p oin ts in the leaf not lab elled by class a and lab elled by class a , resp ectiv ely . The classes assigned to ` C are exactly its pure actions, i.e. { a | (0 , y a ) ∈ θ ( ` C ) } . Note that allowing only pure actions is necessary in order to preserv e safety . W e do not follow the common (machine learning) wa y of assigning classes to the no des based on the ma jorit y criterion, i.e. the ma jority of the data points in that no de allo w the action; b ecause then the decision tree might prescribe unsafe actions just b ecause they w ere allo wed in most of the conﬁgurations in the no de. This is also the reason why the DT-learning algorithm describ ed in the previous section needs to run until the entrop y of all leav es b ecomes 0. 3.4 Learning smaller, yet safe DT W e now describ e how to learn a DT for a safe strategy that is smaller than the exact representation, but still preserves safety . A tree obtained using oﬀ-the-shelf DT learning algorithms is unlikely to exactly represent the original strategy . 6 W e use tw o diﬀeren t methods to ac hieve the goal: ﬁrstly , we use the standard h yp erparameter named minimum split size , and secondly , w e introduce a new p ost-processing algorithm called safe pruning . Both methods rely on the given strategy b eing non-deterministic/p ermissiv e, i.e. p ermitting several actions in a leaf. (1) Using minimum split size The splitting pro cess can b e stopp ed b efore the en tropy b ecomes 0. W e do this b y in tro ducing a parameter k , which determines 6 This is b ecause DT learning algorithms are usually conﬁgured to av oid ov erﬁtting on the dataset A (0, 7): de c (7, 0): neu (7, 0): ac c x ≤ 5 B (0, 7): de c (0, 7): neu (3, 4): ac c F alse T rue C (0, 14): de c (7, 7): neu (10, 4): ac c Fig. 4: Illustration of safe pruning applied to a no de. The pure action of leaf A is just de c , for B it is b oth de c and neu . Safe pruning replaces the no des with C , where only de c is a pure action. the minimum num b er of data p oin ts required in a node to consider splitting it further. During the construction of the tree, a no de is usually split if its entrop y is greater than 0. When k is set to an in teger greater than 2, a no de is split only if b oth the en tropy is greater than 0 and the num b er of data p oin ts (conﬁgurations) in the no de is at least k . The strategy giv en by such a tree is safe as long as it predicts only pure actions, i.e. a with n a = 0. In order to obtain a fully expanded tree, k ma y b e set to 2 (in no des with < 2 conﬁgurations, there is nothing to split). F or larger k , the num b er of pure actions in the leav es decreases. Ultimately , for to o large k , we w ould obtain a tree that has some leaf no des not containing any pure actions. In such a case, the strategy represented b y the DT would not b e w ell-deﬁned, as for some data p oin t no action could b e pick ed. How ever, this can b e detected immediately during the construction. Algorithm 1 Safe Pruning 1: pro cedure Safe-Pruning (DT T σ = ( T , ρ, θ ), p ∈ N ) 2: for i ← 1 ..p do 3: N ← { n ∈ T | LEFT ( n ) and RIGHT ( n ) are leav es } 4:  Candidate no des for pruning 5: for each n ∈ N do 6: c ` ← LEFT ( n ), c r ← RIGHT ( n ) 7: if θ ( c ` ) ∩ θ ( c r ) 6 = ∅ then 8:  Prune and ke ep the common classiﬁcation 9: Con vert n to a leaf no de 10: θ ( n ) ← θ ( c ` ) ∩ θ ( c r ) 11: Remo ve c ` and c r from T (2) Using safe pruning Another wa y of obtaining a smaller tree is by using a pro cedure to prune the leav es of the pro duced tree by merging them while preserving safety . F or example, consider the decision no de on the left of Figure 4 with t wo children that are lea ves A and B . F or A , only the action de c is pure (i.e. allow ed by all conﬁgurations in the leaf ), while for B b oth de c and neu are pure. Since the sets of pure actions of the tw o leaf no des in tersect, w e can safely remo ve b oth A and B and replace the decision no de with a new leaf no de C that con tains only those actions that are in the intersection, in this case only de c . Algorithm 1 des cribes the pruning pro cess formally . If θ returns only safe actions, then the tree obtained after pruning is guaranteed to represent a safe strategy , although a less-p ermissiv e one. The algorithm ma y be run for m ultiple (p ossibly 0) rounds, denoted by p , at most un til w e get a “fully pruned” tree represen ting a safe but deterministic strategy . W e denote b y T k,p σ safe the decision tree for σ safe constructed by only splitting no des with k or more data points, follow ed b y p rounds of safe pruning. Clearly , the more p ermissiv e the original strategy is, the more w e can prune using safe pruning. When generating T k,p σ safe , we use a mo diﬁed implementation of the CAR T decision tree learning algorithm implemented in the DecisionTreeClassifier class of the Python-based machine learning library Scikit-learn [43]. Since we construct the DT from a safe strategy and as long as w e let the DT-enco ded strategy hav e at least one pure action in eac h leaf, the strategy will remain safe. With this in mind, w e can freely c hange the parameters of the DecisionTreeClassifier class. Ho wev er, in our exp erimen ts, we pick ed only the minim um split size k from the Scikit-parameters as a demonstrative example, as well as our newly introduced p . The metho ds describ ed in this pap er would w ork with other parameters as w ell. 3.5 Comparing DTs to Binary Decision Diagrams A Binary Decision Diagram (BDD, e.g. [11]) is a p opular data structure that can b e used to represent b o olean functions f : B n → B . It ma y also b e used to represen t strategies by enco ding conﬁgurations and actions into a suitable form via bit-blasting, i.e. conv erting them into prop ositional formulae. F or example, the conﬁguration-action pair (( x = 6 , y = 2) , a 0 ) can b e represented as ( x 2 ∧ x 1 ∧ ¬ x 0 ∧ ¬ y 2 ∧ y 1 ∧ ¬ y 0 ∧ a 0 ), if it is known that the maximum v alue that x and y can tak e is less than 8 (3 bits). A strategy can b e seen as a disjunction W γ ,a ∈ σ ( γ ) ( γ , a ) of all conﬁguration-action pairs ( γ , a ) p ermitted by the strategy σ . Such an enco ding allo ws for an easy conv ersion into a BDD. Though theoretically straightforw ard, there are some practical concerns inv olved when constructing the BDD. Mainly , the ordering of the v ariables in the BDD can drastically change its size. While computing the optimal ordering so as to hav e the smallest BDD is an NP-complete problem [3], v arious heuristics exist that can b e used to get b etter orderings. W e use the CUDD pack age [54] to construct the BDD, along with Rudell’s Sifting reordering technique [50]. The main disadv an tage of DTs compared to BDDs is that isomorphic subgraphs are not merged (DTs are trees, BDDs are directed acyclic graphs); and even if merging was allow ed, it would not sav e muc h. Indeed, since DT may choose diﬀeren t predicates on the same level (whic h is an adv an tage in contrast to BDD with a ﬁxed v ariable ordering) isomorphic subgraphs o ccur rarely . There are further adv antages of DT, related to learning, that make them more compact than BDD in some contexts, e.g. [8, 9]. Firstly , they can b e learnt fast, using the en tropy-based heuristic, compared to the graph pro cessing and v ariable re-ordering of BDDs. Secondly , a DT can ignore “don’t-care inputs”; these inputs are enco dings of things that are not v alid conﬁguration-action pairs, in the sense that either the action is not av ailable in the conﬁguration or that it is not a v alid conﬁguration at all. In con trast, a BDD has to explicitly either allo w or disallow these inputs. Thirdly , DT learning can also b e used to represen t the strategy imprecisely using a smaller DT, whic h can b e model chec k ed for safety . F or the mo diﬁcations describ ed in Section 3.4, we do not even need to re-verify safety , b ecause this prop ert y is preserved b y b oth our size reduction techniques. F ourthly , DT can use muc h wider class of predicates, compared to single bit tests for a bit represen tation in a BDD. This ﬁnal p oin t is also a reason (together with the smaller size) why DT is a more understandable representation than a BDD [8, 9]. W e also illustrate this p oin t on a case-study in Remark 1. 4 Case Studies and Exp erimen tal Results In this section, we ev aluate the techniques discussed ab o v e on three diﬀeren t case studies: (1) the adaptive cruise control mo del in tro duced in the motiv ation; (2) a t wo tank case study in tro duced in [30]; and (3) the heating system of a t wo ro om apartmen t adapted from [26]. T able 1 compares representations for our case studies obtained in diﬀerent w ays. W e discuss results for the three case studies, denoted cruise , twotanks , and tworooms resp ectiv ely . Additionally , the ﬁrst line displays cruise without the integrated Euler metho d, to illustrate the eﬀect of Euler metho d on the ﬁnal size. All the represen tations are safe and as optimal as σ opt pro duced by Upp aal Stra tego . F or each of the mo dels we display the following information: the third column lists the num b er of items in the explicit list representation of σ opt output by Upp aal Stra tego . The fourth column lists the num b er of those items that are actually relev an t, i.e. sets of conﬁgurations where an actual decision is to b e made. The ﬁfth and sixth column list the sizes of BDD and DT representations learn t from σ opt , i.e. the upper path in Fig. 2. F or BDDs, since the initial ordering plays a role in the size of the ﬁnal result despite applying the re-ordering heuristics, we ran 40 exp erimen ts for each mo del with random initial v ariable orderings. F or creating BDDs, we used the free Python library tulip-c ontr ol/dd as an interface to CUDD. W e conclude that b oth BDDs and DTs reduce the size by several order of magnitude. DTs are slightly b etter in all cases, and 2 orders of magnitude smaller in the tw orooms mo del. Note that reliably achieving go od results when construct- ing the BDD relies on rep eating the construction several times; since already constructing a single BDD and applying the heuristics [50] already to ok roughly 10 times longer than DT learning, DT can b e obtained one or t wo orders of magnitude faster than BDDs, dep ending on how man y times one tries constructing the BDD. F urther, for the t wo tanks, only DT realizes that the strategy is actually trivial. The main reason for BDD not to sp ot this is the p oin t of ignoring “don’t-care” inputs addressed in Section 3.5. T able 2 shows how the size of the DT can b e further reduced by the b ottom path of Fig. 2, when the “exact represen tation” criterion is relaxed. It displays the T able 1: Sizes of the diﬀeren t representations: explicit list as output by Upp aal Stra tego , the relev ant part of the list, BDD displaying [minimum/median/maxim um] ov er the 40 trials, and DT according to the upp er path in Fig. 2. #V ariables Stratego list List BDD[min/med/max] DT T opt size cruise non-Euler 5 1,790,034 308,216 [3,718/5,066/5,890] 2,899 cruise 7 5,931,154 304,752 [3,470/4,728/4,742] 2,713 twotanks 9 23,182 23,182 [65/69/91] 1 tworooms 11 1,924,708 509,715 [16,370/20,214/25,909] 487 T able 2: T ables displaying the num b er |T k,p opt | of no des of T k,p opt (left) and the exp ected p erformance E M ,γ σ,H ( D ) (right) for v arious k and p , i.e. using the b ottom path of Fig. 2, for the cruise mo del. Higher p erformance corresp onds to a low er num b er. Min split size (k) Rounds of pruning (p) 0 1 2 2 2 , 713 1 , 725 1 , 267 10 2 , 705 1 , 733 1 , 249 20 2 , 667 1 , 733 1 , 131 30 2 , 657 1 , 695 993 40 2 , 627 1 , 669 1 , 015 50 2 , 557 1 , 695 1 , 003 60 2 , 635 1 , 489 963 70 2 , 613 1 , 441 955 80 2 , 519 1 , 537 915 90 2 , 455 1 , 323 923 100 1 , 929 1 , 023 877 Min split size (k) Rounds of pruning (p) 0 1 2 2 2 , 627 3 , 618 4 , 240 10 2 , 696 3 , 596 4 , 210 20 2 , 778 3 , 625 14 , 039 30 2 , 778 3 , 589 14 , 108 40 2 , 778 3 , 600 14 , 096 50 2 , 825 3 , 614 14 , 037 60 2 , 905 3 , 673 14 , 074 70 2 , 898 3 , 714 14 , 095 80 2 , 907 3 , 717 14 , 092 90 3 , 006 3 , 741 14 , 077 100 3 , 030 14 , 061 14 , 292 p erformance, i.e. the aggregated distance to F r ont car, and size of T k,p opt for diﬀerent com binations of the pruning parameters k and p . Recall that using no pruning ( k = 2 , p = 0) yields the same DT as the upp er path of Fig. 2, i.e. T 2 , 0 opt = T opt . W e observed that for cruise , increasing the v alues of k and p buys a reduction in size of the DT against a reduction in performance. F or instance, using k = 80 , p = 0, one can decrease the size to 2485 (by 8.4%) while deteriorating the p erformance to 2907 (by 10%). Allo wing for half the p erformance (double the aggregated distance), one can mak e the DT even smaller than half of its original size, e.g. by setting k = 10 , p = 2. The shading and colouring of the table display diﬀerent “trade- oﬀ zones”, each with comparable sa vings/losses. The same conclusions hold for cruise non-Euler , see the similar T able 4 in App endix A.3. F or tworooms (T able 3), the b est p erformance is observed not with k = 2 , p = 0, but with k = 50 , p = 0. W e conjecture that the less p ermissiv e safe strategy assists Stra tego in p erforming the optimisation faster by reducing the size of the search space. As a result, here w e get a b oth smaller and more p erforman t strategy . In the case of twotanks , already T opt has only a single no de, hence no further reductions are p ossible. T able 3: T ables displaying the num b er |T k,p opt | of no des of T k,p opt (left) and the exp ected p erformance E M ,γ σ,H ( D ) (right) for v arious k and p , i.e. using the b ottom path of Fig. 2, for the tworooms mo del. Higher p erformance corresp onds to a low er num b er. Min split size (k) Rounds of pruning (p) 0 1 2 3 2 543 403 283 191 10 525 387 271 185 50 497 365 251 171 125 445 317 219 151 250 387 265 179 123 500 323 211 139 97 750 277 175 111 77 Min split size (k) Rounds of pruning (p) 0 1 2 3 2 2 , 096 2 , 353 2 , 821 3 , 156 10 2 , 156 2 , 460 3 , 285 3 , 283 50 1 , 989 2 , 778 3 , 287 3 , 281 125 2 , 374 2 , 053 3 , 280 3 , 284 250 2 , 283 2 , 071 3 , 288 3 , 282 500 2 , 563 2 , 155 3 , 280 3 , 282 750 2 , 333 2 , 210 3 , 279 3 , 286 R emark 1. In terestingly , domain kno wledge can reduce the DT size further and mak e the representation more understandable. Indeed, for the cruise mo del we w ere able to construct a DT with only 25 no des, designing our predicates based on the car kinematics. F or example, the exp ected time until the front car reac hes minimal velocity if it only decelerates from now on (1) plays an imp ortan t role in the decision making and (2) can b e easily expressed by solving the standard kinematics equation v ( t ) = v current − a dec · t . The resulting DT (illustrated in App endix A.4) is th us v ery small and easy to interpret, as each of the few no des has a clear kinematic interpretation. The DT thus op en the p ossibility for strategy represen tation to proﬁt from predicate/inv ariant syn thesis. 5 Conclusion W e hav e pro vided a framework for pro ducing small represen tations of safe and (near-)optimal strategies, without compromising safety . As to (near-)optimality , we can choose b et ween tw o options: (i) not compromising it, or (ii) ﬁnding a suitable trade-oﬀ b etw een compromising it (causing drops of p erformance) and additional size reductions. Compared to the original sizes, we achiev e orders-of-magnitude reductions, allowing for eﬃcient usage of the strategies in e.g. embedded devices. Compared to BDD representation, the size of the DT representation is smaller and can b e computed faster; additionally trivial solutions are represented by trivial DTs. DTs are more readable as argued in [9, 8]. A detailed examination of the latter p oin t in the h ybrid context remains future w ork. F urther, candidates for more complex predicates could b e automatically generated based on given domain kno wledge or learnt from the data similarly to inv ariants from program runs [24, 53]. As illustrated in Remark 1, this could lead to further reduction in size and improv ed understandability . Additionally , isomorphic/similar subtrees could b e merged as in decision diagrams and further optimizations for algebraic decision diagrams [57] could b e employ ed. Finally , w e plan to visualize the DT representation of the strategies directly in Upp aal Stra tego + for conv enience of the users. References 1. G. Behrmann, A. Cougnard, A. Da vid, E. Fleury , K. G. Larsen, and D. Lime. Uppaal-tiga: Time for playing games! In CA V , 2007. 2. J. Bernet, D. Janin, and I. W alukiewicz. Permissiv e strategies: from parity games to safet y games. IT A , 2002. 3. B. Bollig and I. W egener. Improving the v ariable ordering of ob dds is np-complete. IEEE T r ansactions on Computers , 1996. 4. C. Boutilier, T. L. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. J. Artif. Intel l. R es. , 11:1–94, 1999. 5. C. Boutilier and R. Dearden. Appro ximating v alue trees in structured dynamic programming. In ICML , 1996. 6. C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting structure in policy con- struction. In IJCAI , 1995. 7. P . Bouyer, N. Markey , J. Olschewski, and M. Ummels. Measuring p ermissiv eness in parit y games: Mean-pay oﬀ parity games revisited. In A TV A , 2011. 8. T. Br´ azdil, K. Chatterjee, M. Chmelik, A. F ellner, and J. Kret ´ ınsk´ y. Counterexample explanation by learning small strategies in marko v decision pro cesses. In CA V , 2015. 9. T. Br´ azdil, K. Chatterjee, J. Kret ´ ınsk ´ y, and V. T oman. Strategy represen tation by decision trees in reactive synthesis. In T A CAS , 2018. 10. L. Breiman. Classiﬁc ation and r e gr ession tr e es . Routledge, 2017. 11. R. E. Bryan t. Symbolic manipulation of b o olean functions using a graphical repre- sen tation. In DA C , 1985. 12. J. C. Butc her and N. Go o dwin. Numeric al metho ds for or dinary diﬀer ential e quations . Wiley , 2008. 13. D. Chapman and L. P . Kaelbling. Input generalization in delay ed reinforcement learning: An algorithm and p erformance comparisons. In IJCAI . Morgan Kaufmann, 1991. 14. A. Clare and R. D. King. Knowledge discov ery in multi-label phenot yp e data. In Principles of Data Mining and Know le dge Disc overy , 2001. 15. A. L. Co¨ ent, J. A. D. Sandretto, A. Chap outot, and L. F rib ourg. An improv ed algorithm for the control synthesis of nonlinear sampled switched systems. F ormal Metho ds in System Design , 53(3):363–383, 2018. 16. A. David, D. Du, K. G. Larsen, M. Mikucionis, and A. Skou. An ev aluation framew ork for energy aw are buildings using statistical mo del chec king. SCIENCE CHINA Information Scienc es , 55(12):2694–2707, 2012. 17. A. David, P . G. Jensen, K. G. Larsen, A. Legay , D. Lime, M. G. Sørensen, and J. H. T aankvist. On time with minimal exp ected cost! In A TV A , 2014. 18. A. David, P . G. Jensen, K. G. Larsen, M. Mikucionis, and J. H. T aankvist. Uppaal stratego. In T ACAS , 2015. 19. A. David, P . G. Jensen, K. G. Larsen, M. Mikuˇ cionis, and J. H. T aankvist. Uppaal stratego. In T ACAS , 2015. 20. L. de Alfaro, M. Z. Kwiatko wsk a, G. Norman, D. Park er, and R. Segala. Sym- b olic mo del chec king of probabilistic pro cesses using mtbdds and the kroneck er represen tation. In T A CAS , 2000. 21. K. Dr¨ ager, V. F orejt, M. Z. Kwiatko wsk a, D. Park er, and M. Ujma. P ermissive con troller synthesis for probabilistic systems. In T ACAS 2014 , 2014. 22. F. Esp osito, D. Malerba, and G. Semeraro. Decision tree pruning as a search in the state space. In ECML , 1993. 23. A. F ehnker and F. Iv anˇ ci ´ c. Benchmarks for hybrid systems veriﬁcation. In HSCC , 2004. 24. P . Garg, C. L¨ oding, P . Madhusudan, and D. Neider. ICE: A robust framew ork for learning inv ariants. In CA V , 2014. 25. A. Girard. Controller synthesis for safety and reachabilit y via approximate bisimula- tion. Automatic a , 48(5):947–953, 2012. 26. A. Girard. Low-complexit y quantized switching controllers using approximate bisim- ulation. Nonline ar Analysis: Hybrid Systems , 10:34–44, 2013. 27. A. Girard and S. Martin. Syn thesis for constrained nonlinear systems using hy- bridization and robust controllers on simplices. IEEE T r ans. Automat. Contr. , 57(4):1046–1051, 2012. 28. E. M. Hahn, G. Norman, D. Park er, B. W ach ter, and L. Zhang. Game-based abstraction and controller synthesis for probabilistic h ybrid systems. In QEST , 2011. 29. H. Hermanns, M. Z. Kwiatko wsk a, G. Norman, D. Park er, and M. Siegle. On the use of mtbdds for p erformabilit y analysis and veriﬁcation of sto c hastic systems. J. L o g. Algebr. Pr o gr am. , 56(1-2):23–67, 2003. 30. I. A. Hiskens. Stability of limit cycles in hybrid systems. In HICSS , 2001. 31. J. Ho ey , R. St-aubin, A. Hu, and C. Boutilier. Spudd: Sto c hastic planning using decision diagrams. In UAI , 1999. 32. M. Kearns and D. Koller. Eﬃcien t reinforcemen t learning in factored MDPs. In IJCAI , 1999. 33. D. Koller and R. P arr. Computing factored v alue functions for p olicies in structured MDPs. In IJCAI , 1999. 34. N. Kushmerick, S. Hanks, and D. W eld. An algorithm for probabilistic least- commitmen t planning. In AAAI , 1994. 35. K. G. Larsen, A. Le Co¨ ent, M. Mikuˇ cionis, and J. T aankvist. Guaranteed control syn thesis for contin uous systems in Upp aal Tiga . In CyPhy , 2018. 36. K. G. Larsen, M. Mikucionis, and J. H. T aankvist. Safe and optimal adaptive cruise con trol. In Corr e ct System Design , 2015. 37. A. Le Co¨ en t, F. De V uyst, L. Chamoin, and L. F rib ourg. Control syn thesis of nonlinear sampled switched systems using Euler’s metho d. In SNR , 2017. 38. S. Liu, A. P anangadan, A. T alukder, and C. S. Ragha vendra. Compact represen tation of co ordinated sampling p olicies for b ody sensor netw orks. In 2010 IEEE Glob e c om Workshops , 2010. 39. R. Ma jumdar, E. Render, and P . T abuada. Robust discrete synthesis against unsp eciﬁed disturbances. In HSCC , 2011. 40. A. S. Miner and D. Park er. Symbolic representations and analysis of large probabilistic systems. In V alidation of Sto chastic Systems , Lecture Notes in Computer Science. Springer, 2004. 41. J. Mingers. An empirical comparison of pruning metho ds for decision tree induction. Machine L e arning , 4:227–243, 1989. 42. T. M. Mitchell. Machine L e arning . McGraw-Hill, Inc., 1997. 43. F. Pedregosa, G. V aro quaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P . Prettenhofer, R. W eiss, V. Dub ourg, J. V anderPlas, A. P assos, D. Cournap eau, M. Brucher, M. Perrot, and E. Duchesna y . Scikit-learn: Machine learning in python. Journal of Machine L e arning R ese ar ch , 12:2825–2830, 2011. 44. M. L. Puterman. Markov De cision Pr o cesses . J. Wiley and Sons, 1994. 45. L. D. Pyeatt. Reinforcement learning with decision trees. In Applie d Informatics , pages 26–31, 2003. 46. J. R. Quinlan. Induction of decision trees. Machine L e arning , 1986. 47. J. R. Quinlan. C4.5: Pr o gr ams for Machine L e arning . 1993. 48. P . J. Riddle, R. Segal, and O. Etzioni. Represen tation design and brut-force induction in a b oeing manufacturing domain. Applie d Artiﬁcial Intel ligenc e , 8, 1994. 49. P . Roy , P . T abuada, and R. Ma jumdar. Pessoa 2.0: a controller synthesis to ol for cyb er-ph ysical systems. In HSCC , 2011. 50. R. Rudell. Dynamic v ariable ordering for ordered binary decision diagrams. In CAD , 1993. 51. M. Rungger and M. Zamani. Scots: A to ol for the synthesis of symbolic controllers. In HSCC , 2016. 52. A. Saoud, A. Girard, and L. F rib ourg. On the comp osition of discrete and contin uous- time assume-guarantee contracts for inv ariance. In ECC , 2018. 53. R. Sharma, S. Gupta, B. Hariharan, A. Aiken, and A. V. Nori. V eriﬁcation as learning geometric concepts. In SAS , 2013. 54. F. Somenzi. Cudd: Cu decision diagram pack age release 2.4. 2. 2009. 55. M. Svoreˇ nov´ a, J. K ˇ ret ´ ınsk ` y, M. Chmel ´ ık, K. Chatterjee, I. ˇ Cern´ a, and C. Belta. T emp oral logic control for sto c hastic linear systems using abstraction reﬁnemen t of probabilistic games. Nonline ar Analysis: Hybrid Systems , 23:230–253, 2017. 56. R. Wimmer, B. Braitling, B. Bec ker, E. M. Hahn, P . Crouzen, H. Hermanns, A. Dhama, and O. Theel. Symblicit calculation of long-run a verages for concurrent probabilistic systems. In QEST , 2010. 57. I. S. Zapreev, C. V erdier, and M. Mazo. Optimal symbolic controllers determinization for BDD storage. In ADHS , 2018. A App endix A.1 Description of case studies Two r o oms mo del. This case study is based on a simple mo del of a tw o-ro om apartmen t, heated by one heater in each ro om. In this example, the ob jective is to control the temp erature of b oth ro oms b y switc hing on or oﬀ the heaters in the ro oms. There is heat exchange b et ween the tw o ro oms and with the environmen t (external temp erature, which v aries randomly within a set). Cruise Contr ol. Tw o cars Ego and F r ont are driving on a road as sho wn in Figure 1. W e are capable of controlling Ego but not F r ont . Both cars can drive a maxim um of 20 m/s forward and a maximum of 10 m/s backw ards. The cars hav e three diﬀeren t p ossible accelerations: -2 m/s 2 , 0 m/s 2 and 2 m/s 2 , b et ween which they can switch instantly . F or the cars to b e safe, there should b e a distance of at least 5 m b et ween them. Any distance less than 5 m b et ween the cars is considered unsafe. Ego ’s sensors can detect the p osition of F r ont only within 200 meters. If the distance b etw een the cars is more than 200 meters, then F r ont is considered to b e far away . In this example, the aim is to synthesize a strategy for con trollable car Ego suc h that it alw ays stays far enough from uncontrollable car F r ont . F or the adaptiv e cruise con trol mo del, we diﬀeren tiate b et w een the model that is only safe at time p oin ts where an action is p ossible, called cruise non-Euler ; and the mo del improv ed with the Euler metho d to b e safe at all time p oin ts b etw een t wo actions, called just cruise . Two T anks. The tw o-tank system, illustrated in Figure 5, is a linear example tak en from [30]. The system consists of t wo tanks and tw o v alv es. The ﬁrst v alve ( Q 1 in Fig. 5) adds to the inﬂo w of tank 1 and the second v alve ( Q 2 ) is a drain v alve for tank 2. There is a constan t outﬂo w from tank 2 caused by a pump ( Q B ), Fig. 5: The tw o tank example. as well as a constant inﬂow into tank 1 ( Q 0 ). There is also a ﬂow from tank 1 to tank 2 ( Q A ) which dep ends on the water level in tank 1. The system is linearised at a desired op erating p oin t. The ob jective is to keep the water level in b oth tanks within limits using a discrete op en/close switching strategy for the v alves. A.2 Guaran teed ov er-appro ximation using the Euler Metho d W e brieﬂy recall the technique used in [35] to compute strategies that are safe at all time p oin ts, not just at multiples of the p eriod P . Let us consider an HMDP M = ( C, U, X , F, δ ) where the dynamics of each con tinuous v ariable x i ∈ X is sub ject to a diﬀerential equation of the form ˙ x i = f i ( c, u, x ) , (1) with c ∈ C and u ∈ U . F or every γ = ( c, u, x ) and τ ∈ R , the up date γ τ → ( c, u, F ( c,u ) ( τ , x )) must match the solution of the diﬀerential equation (1), i.e. F ( c,u ) ( τ , x ) = x + Z τ 0 f ( c, u, x ) dt. (2) It is easy to compute the solution of a linear diﬀerential equation [12]. Ho wev er, in the general case, computing the integral on the right-hand side of (2) is not p ossible exactly . Approximate solutions can be obtained with the use of n umerical schemes, suc h as Euler or Runge-Kutta sc hemes, but in order to ensure absolute safety of the system, they should b e asso ciated with guaranteed error b ounds. F urthermore, one has to ensure that the system is safe b et ween time steps ( γ 0 ∈ S whenev er γ i τ → γ 0 with τ ≤ P ). In order to achiev e that, [35] pro ceeds as follows. First, the given HMDP M = ( C, U, X , F, δ ) is extended into an HMDP M 0 = ( C, U, ( X min , X max ) , F 0 , δ 0 ) where X min = ( x min 1 , . . . , x min n ) and X max = ( x max 1 , . . . , x max n ) will form a low er and an upp er b ound on the actual v aluation of X at all times, and δ 0 is an abstract transition function for the opp onen t. F rom a given global conﬁguration γ 0 = ( c, u, ( x min , x max )) of M 0 , δ 0 enables all p ossible transitions that ha ve non zero probabilit y from γ = ( c, u, x ) in M for some v alue of x b et w een x min and x max , i.e. δ 0 = { u | δ ( c,u, x ) ( u ) > 0 , x min ≤ x ≤ x max } . Then the ﬂow function F 0 c,u : R > 0 × R ( X min ,X max ) → R ( X min ,X max ) implemen ts a guaranteed (set-based) Euler metho d which ensures the following prop ert y: whenev er – ( w min , w max ) = F 0 ( c,u ) ( P , ( v min , v max )) and – w = F ( c,u ) ( t, v ) for some t ≤ P and v min ≤ v ≤ v max then also w min ≤ w ≤ w max . F or further details, w e refer the reader to [37, 35]. F rom the augmented HMDP M 0 , w e can build a standard timed game [35] T G b y under-approximating X min to its ﬂo or integer part, and X max to its ceiling in teger part. Consequently , one can use Upp aal Tiga to synthesize a strategy that ensures the safety of the initial HMDP . A sim ulation sho wing the b ounding of the con tinuous tra jectory of M b y its timed-game abstraction T G is given in Fig. 6. Fig. 6: Simulation of the distance b et ween the cars, the red line is the contin uous tra jectories of the HMDP M , the blue and green lines are the in teger tra jectory of the timed game T G b ounding it within time. A.3 More exp erimen tal results T able 4 shows the trade-oﬀ b et w een size and p erformance for the cruise non-Euler mo del. Figure 7 sho ws a simulation run of tworooms . T able 4: T ables displaying the num b er |T k,p opt | of no des of T k,p opt (left) and the exp ected p erformance E M ,γ σ,H ( D ) (right) for v arious k and p , i.e. using the b ottom path of Fig. 2, for the cruise non-Euler mo del. A low er corresp onds to a higher p erformance. Min split size (k) Rounds of pruning (p) 0 1 2 2 2 , 977 1 , 693 1 , 329 10 2 , 981 1 , 685 1 , 329 20 2 , 911 1 , 739 1 , 289 30 2 , 919 1 , 735 1 , 127 40 2 , 847 1 , 649 1 , 055 50 2 , 839 1 , 665 1 , 003 60 2 , 579 1 , 629 961 70 2 , 543 1 , 513 939 80 2 , 467 1 , 463 909 90 2 , 427 1 , 401 845 100 2 , 419 1 , 041 897 Min split size (k) Rounds of pruning (p) 0 1 2 2 1 , 855 2 , 744 3 , 592 10 1 , 889 2 , 735 3 , 597 20 1 , 932 2 , 845 13 , 448 30 1 , 937 2 , 843 13 , 490 40 1 , 975 2 , 844 13 , 494 50 1 , 976 2 , 859 13 , 450 60 2 , 015 2 , 843 13 , 637 70 2 , 018 2 , 862 10 , 145 80 2 , 050 2 , 976 13 , 685 90 2 , 107 3 , 024 13 , 716 100 2 , 211 11 , 532 13 , 876 Fig. 7: Simulation of the temp erature of Ro om 2 rT2 (red line). The blue and green lines are the integer tra jectory of the timed game T G b ounding it within time. A.4 Handcrafted Small DT for Adaptive Cruise Control Listing 1.1: Handcrafted C-co de describing the safe and optimal strategy for cruise b o o l s t r a t e g y ( i n t a c t , i n t v E g o , i n t d , i n t v F r o n t , i n t a F r o n t , i n t a E g o ) { i f ( c h e c k ( d , v E g o , v F r o n t , 2 , a F r o n t ) > 5 ) { r e t u r n a c t = = 1 ; / / a c c e l e r a t e } e l s e i f ( c h e c k ( d , v E g o , v F r o n t , 0 , a F r o n t ) > 5 ) { r e t u r n a c t = = 0 ; / / n e u t r a l } e l s e { a s s e r t ( c h e c k ( d , v E g o , v F r o n t , - 2 , a F r o n t ) > 5 ) ; r e t u r n a c t = = 2 ; / / d e c e l e r a t e } } f l o a t c h e c k ( i n t d , i n t v E g o , i n t v F r o n t , i n t a E g o , i n t a F r o n t ) { / / O n e u n i t o f a E g o w h i l e f r o n t p l a y s w o r s t c a s e f l o a t d 1 , d 2 , d 3 , t 1 , t 2 , v F r o n t N e x t , v E g o N e x t ; t 1 = ( v F r o n t + 1 0 ) / 2 ; i f ( t 1 > 0 . 5 ) { d 1 = d _ t ( d , v E g o , v F r o n t , a E g o , - 2 , 1 ) ; v F r o n t N e x t = v F r o n t - 2 ; v E g o N e x t = v E g o + a E g o ; } e l s e { d 1 = d _ t ( d , v E g o , v F r o n t , a E g o , 0 , 1 ) ; v F r o n t N e x t = v F r o n t ; v E g o N e x t = v E g o + a E g o ; } / / D e c e l e r a t i n g w h i l e f r o n t d e c e l e r a t e s t o - 1 0 i f ( t 1 > 1 ) { d 2 = d _ t ( d 1 , v E g o N e x t , v F r o n t N e x t , - 2 , - 2 , t 1 - 1 ) ; v E g o N e x t = v E g o N e x t - 2 * ( t 1 - 1 ) ; } e l s e { d 2 = d 1 ; } / / D e c e l e r a t i n g w h i l e f r o n t s t a y s a t - 1 0 t 2 = ( v E g o N e x t + 1 0 ) / 2 ; i f ( t 2 > 0 ) { d 3 = d _ t ( d 2 , v E g o N e x t , - 1 0 , - 2 , 0 , t 2 ) ; } e l s e { d 3 = d 2 ; } r e t u r n d 3 ; } f l o a t d _ t ( i n t d , i n t v E g o , i n t v F r o n t , i n t a E g o , i n t a F r o n t , i n t t ) { r e t u r n d + 0 . 5 * ( a F r o n t - a E g o ) * t * t + ( v F r o n t - v E g o ) * t ; }

SOS: Safe, Optimal and Small Strategies for Hybrid Markov Decision Processes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment