Safe Reinforcement Learning Using Robust MPC

1 Safe Reinforcement Learning Using Rob ust MPC Mario Zanon and S ´ ebastien Gros Abstract —Reinfor cement Learning (RL) has recently im- pressed the world with stunning results in various applications. While the potential of RL is now well-established, many critical aspects still need to be tackled, including safety and stability issues. These issues, while secondary for the RL community , are central to the control community which has been widely in vestigating them. Model Predicti ve Control (MPC) is one of the most successful control techniques because, among others, of its ability to provide such guarantees even for uncertain constrained systems. Since MPC is an optimization-based technique, optimal- ity has also often been claimed. Unfortunately , the performance of MPC is highly dependent on the accuracy of the model used for predictions. In this paper , we propose to combine RL and MPC in order to exploit the advantages of both and, ther efor e, obtain a controller which is optimal and safe. W e illustrate the results with two numerical examples in simulations. Index T erms —Reinfor cement Learning, Robust Model Predic- tive Control, safe policies I . I N T RO D U C T I O N Reinforcement Learning (RL) is a technique for solving problems in v olving Markov Decision Processes (MDP) [1]. In RL, rather than modeled state transition probabilities, samples and observed costs (or rewards) are used. RL algorithms enabled computers beating Chess and Go masters [2], and robots learning to walk or ﬂy without supervision [3], [4]. For each state s the optimal action a is computed as the the optimal feedback policy π ( s ) for the real system either directly (policy search methods) [5], [6] or indirectly (SARSA, Q -learning) [7]. In the latter , the optimal policy π ( s ) is indirectly obtained as the minimizer of the so-called action-value function Q ( s , a ) over the action or input a . In both cases, either π or Q are typically approximated by a function approximator: Deep Neural Network (DNN) are very commonly used for that purpose in recent applications. While RL has demonstrated in practice a huge potential, properties that are typically expected from a controller, such as, e.g., some form of stability and safety , are hard to guarantee, especially when relying on a DNN as a function approximator . In general, any safety-enforcing approach re- quires either (a) to collect data from the real system thus incurring catastrophic events, (b) to use a model of the system to generate a suf ﬁcient amount of stochastic simulations, or (c) to model the uncertainty underlying the system, as we will detail later . Some approaches hav e been dev eloped in order to guarantee some form of safety: see, e.g., the excellent survey in [8] and references therein. Most approaches, ho wev er, do not strictly guarantee that a gi ven set of constraints is never violated, but rather that violations are rare events. Some M. Zanon is with the IMT School for Advanced Studies Lucca, Italy . S. Gros is with the Department of Engineering Cybernetics, NTNU, Norway . approaches propose to project the action resulting from a DNN onto a safe set: in [9] each constraint is approximated as a ReLu or an additional cost and in [10] a QP is used. In both approaches the nominal prediction is used and uncertainty is neglected in the predictions, similarly to the approach of [11]. Projection approaches have been analyzed in the general case in [12]. The combination of learning and control techniques has been proposed in, e.g., [13], [14], [15], [16], [17]. The com- bination of RL and the linear quadratic regulator has been presented in [18], [19]. T o the best of our knowledge, [11], [20], [21] are the ﬁrst works proposing to use NMPC as a function approximator in RL. While strategies for pro viding some form of safety have been developed [8], to the best of the authors’ kno wledge, none of these approaches is able to strictly satisfy some set of constraints at all time. Rather , constraint violation is strongly penalized in [11] and some of the approaches in [8]. The only contribution providing robust constraint satisfaction guarantees is [22], where a linear feedback policy is learned. In this paper we propose an RL formulation based on MPC which addresses the issue of safety , which we did not rigorously enforce in [11], [20]. W e summarize next two e xisting approaches and the scheme we propose, which combines them: a) Rob ust MPC, Business-as-usual: a lo w-dimensional computationally tractable uncertainty set is ﬁrst identiﬁed and then used to formulate a robust MPC problem, see (20). b) RL, Business-as-usual: penalizing violations with a suitably high cost lets the optimization procedure yield a policy which tends to not violate the constraints. Safety is typically not strictly guaranteed and only few results provide weak guarantees. c) Safe RL MPC: In the approach we propose, a robust MPC problem is formulated similarly to (a). Similarly to (b), RL updates the parametrization of the robust MPC scheme, and of the safety constraint to reduce conservatism while preserving safety . Safe RL-MPC is based on the approach ﬁrst advocated in [11], [20]. The scheme can be seen from two alternati ve points of vie w: (a) MPC is used as a function approximator within RL in order to provide safety and stability guarantees; and (b) RL is used in order to tune the MPC parameters, thus improving closed-loop performance in a data-driven fashion. Since safety is fundamental not only during exploitation, but also during exploration, we also address the issue of guaranteeing constraint satisf action during this phase. Another important contribution of this paper is the de velop- ment of an ef ﬁcient way to deal with the enormous amount of data typically collected by autonomous systems. In particular , the introduction of a nominal (potentially inaccurate) linear 2 model allows one to signiﬁcantly reduce the amount of stored data. Further ef ﬁciency is obtained by e xploiting con vexity and using a low-dimensional approximation of the uncertainty set. W e ﬁrst present RL at a conceptual lev el and propose adaptations in order to mak e RL applicable to the safety- enforcing setup. The main issue to be tackled is related to the safety constraints, which need to be enforced when updating the function approximator parameter . W e formulate the update by resorting to a constrained optimization problem, similarly to what has been proposed in [20]. The proposed safe RL can be directly applied to Q -learning, but actor -critic techniques require some adaptation when the input space is continuous and restricted by safety requirements as discussed in [23]. Contributions: This paper proposes an approach to com- bine MPC and RL so as to guarantee safety . In order to limit the complexity of the algorithm, we (a) rely on a linear system with bounded disturbances for predictions, and (b) propose an ad-hoc formulation of the robust constraint satisfaction which relies on the linear model to greatly reduce the amount of data to be stored. Finally , adaptations to the standard RL algorithms to guarantee that the parameter update does not jeopardize safety are introduced. The paper is structured as follows. W e introduce the prob- lem of safe RL in general terms in Section II, while the rest of the paper specializes on the case of linear systems. W e propose a tailored function approximator based on rob ust MPC in Section III and discuss the efﬁcient use of data in Section IV. The necessary modiﬁcations to the standard RL algorithms are proposed in Section V and the whole framework is tested in simulations in Section VI. Conclusions and an outline for future research are given in Section VII. Notation: a is scalar , a ∈ R n a is a vector with compo- nents a i , A is a matrix with ro ws A i and A , A are sets. F or any set, | · | deﬁnes its cardinality . For any function f ( x ) , we deﬁne f ( X ) := { f ( x ) | x ∈ X } and denote f ( x ) ≤ 0 , ∀ x ∈ X as f ( X ) ≤ 0 . The only exceptions are J , V , Q which are scalar functions, but denoted by capital letters in the literature. I I . B A C K G RO U N D A N D S A F E R L F O R M U L AT I ON In this paper , we consider real system dynamics described as a Markov Process (MP) with continuous state s and action a , with state transitions s , a → s + having the underlying conditional probability density P [ s + | s , a ] . (1) W e furthermore consider a deterministic policy delivering the control input as a = π ( s ) , resulting in state distrib ution τ π . The RL problem then reads as π ? := arg min π J ( π ) := E τ π " ∞ X k =0 γ k ` ( s k , π ( s k )) # , (2) where ` is called stage cost in optimal control and − ` instantaneous reward in RL. The scalar γ is a discount factor , typically smaller than 1 in RL, and 1 in MPC. Note that in (2) we provide the deﬁnition for non-episodic settings, but the dev elopments of the paper readily apply to episodic settings. The value function V ? ( s ) is the optimal cost, obtained by applying the optimal policy π ? , i.e., V ? ( s 0 ) = E τ π ? " ∞ X k =0 γ k ` ( s k , π ? ( s k ))      s 0 # ; (3) and the Bellman Equation deﬁnes the action-value function Q ? ( s , a ) := ` ( s , a ) + γ E [ V ? ( s + ) | s , a ] , (4) where the expectation is taken over the state transition (1). The various forms of RL use parametric approximations V θ , Q θ , π θ of V ? , Q ? , π ? in order to ﬁnd the parameter θ ? which (approximately) solves (2) either directly or by approximately solving (4) in a sampled-based fashion. For both approaches we summarize this as θ ? := min θ n X k =0 ψ ( s k +1 , s k , a k , θ ) , (5) where function ψ depends on the speciﬁc algorithm, e.g., ψ ( s k +1 , s k , a k , θ ) = ( ` ( s k , a k ) + γ V θ k ( s k +1 ) − Q θ ( s k , a k )) 2 (6) and n = 1 in recursiv e Q learning formulations; and E τ π θ [ ψ ( s k +1 , s k , a k , θ )] = J ( π θ ) (7) in polic y gradient approaches [1]. T o optimize performance, the system ought to be controlled by using the best available policy π θ , this is referred to as e xploitation . Howe ver , in order for Problem (5) to be well-posed in general, it is necessary to also collect data by deviating from π θ and implementing a different policy π e , this is referred to as exploration . Among others, one of the main dif ﬁculties related with RL is safety enforcement, e.g., collision av oidance. In this paper we deﬁne safety through a set of constraints ξ ( s , ˆ π ( s )) ≤ 0 , ∀ s ∈ supp  τ ˆ π  , (8) where we note supp  τ ˆ π  the support of the distrib ution of the MP (1) subject to polic y ˆ π . Ideally , condition (8) should be satisﬁed at all times with unitary probability , both during exploitation, i.e., ˆ π = π θ , and exploration, i.e., ˆ π = π e 6 = π θ . Note that (8) can only hold if process (1) has bounded support. Enforcing (8) poses two major challenges: (i) either the support of (1) or the support of τ ˆ π must be kno wn or estimated; (ii) giv en knowledge on either supports, a policy satisfying (8) must be designed. Problem (i) is fundamental, since one can ne ver ha ve the guarantee of being able to observ e the full support of (1). Arguably , a reasonable approach can be to approximate the support based on the information extracted from the av ailable samples, or on a prior , or on both. W e assume that collected data is informativ e such that, in the limit for an inﬁnite amount of data, the support is reconstructed exactly for policy π θ . A theoretical justiﬁcation of this is beyond the scope of this paper . In order to construct an approximation of the MP support, we ﬁrst deﬁne the dispersion set conﬁning the state transitions as: S + ( s , a ) = { s + | P [ s + | s , a ] > 0 } . (9) 3 s + − A s − B a − b w Con v ( { w , ¯ W } ) ¯ W Safe RL (18) ≡ (42) θ Robust MPC (20) or (43) a π θ ( s ) System (1) s + z − 1 s s a ` ( s , a ) Q θ , π θ , ∇ Q θ , ∇ π θ Section IV Section V Section III Fig. 1: Schematics of the proposed setup: data are used to construct the SDC based on ¯ W and to ev aluate the cost in (5). This cost depends on Q θ , V θ obtained from MPC, and ` . MPC controls the system. The signal toggling between exploitation and exploration is omitted to avoid confusion, and is a signal sent from RL to switch between MPC (20) and (43). Since S + is not kno wn a priori, we introduce a parametrized approximation ˆ S + based on function g θ giv en by: ˆ S + ( s , a , θ ) := { s + | g θ ( s + , s , a ) ≤ 0 } . (10) In order to enforce safety , ˆ S + must be an outer approximation of set S + , i.e., θ must be chosen such that: ˆ S + ( s , a , θ ) ⊇ S + ( s , a ) , ∀ s , a . (11) W e label this condition Safe-Design Constr aint (SDC), since it restricts the values that θ can take based on safety concerns. In order to discuss safety in mathematically simple terms, let us introduce the worst-case mass of (1) outside of ˆ S + , deﬁned as: χ ( θ ) = sup s , a E h s + / ∈ ˆ S + ( s , a , θ )    s , a i ∈ [0 , 1] , (12) where the expected value is taken over (1). F or a given θ , (11) is ensured if χ ( θ ) = 0 . Ho wev er, χ ( θ ) is kno wn only if we assume the real system dynamics (1) are kno wn. Otherwise, in the Bayesian conte xt, χ ( θ ) ought to be treated as a random variable, reﬂecting our imperfect knowledge of it, conditioned on the current data D (i.e., knowledge) we hav e of the system. W e therefore consider η ( D ) = P [ χ ( θ ) = 0 | D ] , (13) which provides a formal deﬁnition of the probability that ˆ S + is capturing the real system dispersion provided the data D . While computing η is difﬁcult in the general case, its deﬁni- tion allows us to discuss in rigorous terms the fundamental limitation provided next. Fundamental Limitation 1: For any D , if ˆ S + ⊂ R n s , it is impossible to guarantee that η ( D ) = 1 (14) without introducing an y additional assumption on (1). In other words, giv en a ﬁnite set of samples and, possibly , prior but not absolute kno wledge of the system, one does not have enough information to construct a set containing all future samples, unless this set is R n s . This is a fundamental limitation of robust constraint satisfaction and, therefore, it is independent of the proposed approach. In the follo wing, we will therefore discuss η -safety , underlining that (14) cannot be achiev ed in practice. Arguably , by introducing additional assumptions restricting the function space to which (1) can belong, one can en vision ov ercoming this limitation. The in vestigation of this topic is beyond the scope of this paper and will be the subject of future research. In order to formulate some form of SDC with limited information, we introduce the sample-based form of (11) as s k +1 ∈ ˆ S + ( s k , a k , θ ) , ∀ k . (15) For problem (ii), the main challenge is to ﬁnd v alues of θ such that the policy π θ strictly satisﬁes the safety constraints. W e deﬁne safety based on the set dispersion propagation under policy π θ , which reads as S π θ k +1 := ˆ S + ( S π θ k , π θ ( S π θ k ) , θ ) , S π θ 0 = s 0 . (16) Deﬁnition 1 ( η -safe P olicy): F or a given data set D and a giv en set of initial conditions S 0 , a policy π θ is labeled as η -safe for initial states s 0 ∈ S 0 if it satisﬁes ξ ( S π θ k , π θ ( S π θ k )) ≤ 0 , ∀ k ≥ 0 . (17) In general, there can exist initial states for which a safe policy cannot e xist [24], and Deﬁnition 1 characterizes policies that preserve safety for a given initial state. Providing safety guarantees is arguably an open problem when using DNNs as function approximators. Howe ver , this problem has been studied in control theory and one successful design technique is robust MPC [25], [26], [27]. Therefore, instead of building the function approximations based on the commonly used DNN approaches, we will use rob ust MPC, within an extended version of the RL-MPC scheme proposed in [11], [20]. Note that it has been proven in [11] that the optimal policy π , value and action-value functions V ? and Q ? can be recov ered exactly by function approximations based on MPC, provided that their parametrization is rich enough, ev en in case the model used in MPC is different from (1). In order to guarantee safety , robust MPC would ideally rely on the propagation of the dispersion set (16). Unfortunately , this poses se vere computational challenges and an auxiliary (time-varying) policy π MPC θ ,k is preferred in order to recover a computationally tractable formulation [26]. Policy π MPC θ ,k is typically selected as an open-loop input proﬁle corrected by an af ﬁne feedback, see Section III-A. W e remark that MPC deliv ers π θ = π MPC θ , 0 . By construction, robust MPC deliv ers a policy π θ which satisﬁes constraints ξ at all future times, provided that the SDC (15) holds for all k . Ev en though the dispersion set propagation S π MPC θ ,k k is computed based on π MPC θ ,k , the constraints are guaranteed to hold also for S π θ k . 4 Since the shape of ˆ S + impacts the closed-loop performance, one can let safe RL adapt ˆ S + . Ho wev er, this requires one to enforce (15) explicitly in RL. The safe RL problem is then formulated as θ ? := min θ n X k =0 ψ ( s k +1 , s k , a k , θ ) (18a) s . t . s k +1 ∈ ˆ S + ( s k , a k , θ ) , ∀ k . (18b) By enforcing (18b), RL is explicitly made aw are that some parameter updates are unsafe and, therefore, not feasible. Provided that the SDC holds, MPC delivers a safe policy by construction. Though in principle the SDC has to be enforced for each sample, in Section IV we propose an approach to largely reduce the amount of constraints. Remark 1: The RL Problem (18) is typically solv ed using sensitivity-based methods, hence we need to differentiate the function approximator with respect to the parameter θ . In our case, we need to differentiate the robust MPC problem. This will be detailed in Section III-B. The proposed safe RL framew ork performs the follo wing steps: (a) at e very time instant MPC is solved and differenti- ated; the MPC input is applied to the system; state transitions are observed and data is collected to form the sample-based SDC (15); (b) the RL problem (18) is solved (possibly at a lower sampling rate than MPC) and parameter θ is updated whenev er possible. A scheme is displayed in Figure 1, with reference to the section where each component is discussed. In this section, we hav e established the safe RL frame work; in the next sections, we will discuss (a) how to implement robust MPC, (b) ho w to dif ferentiate it in order to be able to solve Problem (18), and (c) ho w to manage constraint (18b) in a data-ef ﬁcient fashion. In this paper , we address (a)- (c) by relying on a linear model of (1) to enforce safety . On the one hand, this choice makes the robust constraint satisfaction tractable and not excessiv ely demanding in terms of computations. On the other hand, any nonlinearity present in the system will be accounted for as a perturbation, therefore introducing some conservatism. W e remark that nonlinear robust MPC formulations have been de veloped and can be deployed within the proposed algorithmic framework. The main dra wbacks of these formulations are: (a) some form of conservatism cannot be a voided and (b) the computational burden typically becomes prohibitiv e. I I I . R O B U S T M P C B A S E D O N I N V A R I A N T S E T S Since guaranteeing robust constraint satisf action in the gen- eral nonlinear case is extremely difﬁcult [26], in the remainder of the paper we focus on the case of an afﬁne model. W e describe safety constraints (8) as the inner approximation C s + D a + ¯ c ≤ 0 . (19) W e formulate a Q function approximator based on the classic robust linear MPC [26], [25]: Q θ ( s , a ) := min z N − 1 X k =0 γ k  x k u k  > H  x k u k  + h >  x k u k  ! + γ N  x > N P x N + p > x N  (20a) s . t . x 0 = s , u 0 = a , (20b) x k +1 = A x k + B u k + b , k ∈ I N − 1 0 , (20c) C x k + D u k + c k ≤ 0 , k ∈ I N − 1 0 , (20d) G x N + g ≤ 0 , (20e) where z := ( x 0 , u 0 , . . . , u N − 1 , x N ) . Note that we use x , u to distinguish the MPC predictions from the actual state and action realizations s , a which deﬁne the initial con- straint (20b). The dynamic constraints (20c) assume a nominal model without any perturbation. The tube-based approach then treats the system stochasticity , model uncertainties and safety constraint (19) by performing a suitable tightening of the path constraints (20d), i.e., c k ≥ ¯ c is used. The terminal constraints (20e) are introduced to guarantee that the path constraints will ne ver be violated at all future times k > N . The v alue function and optimal policy are obtained as [11] V θ ( s ) := min a Q θ ( s , a ) , π θ ( s ) := arg min a Q θ ( s , a ) . (21) In practice V θ and π θ are computed jointly by solving (20) without enforcing the constraint u 0 = a . Parameter θ to be adapted by RL may include any of the vector and matrices deﬁning the MPC scheme (20), i.e., generally θ = { H , h , P, p , A, B , b , ¯ c , K, θ W } . (22) Parameters H, P are typically assumed positiv e-deﬁnite to guarantee the solvability of (20). Parameters K and θ W do not appear explicitly in (20), and are used to compute the constraint tightening and the terminal set, which will be introduced next: we will ﬁrst present the computation of the constraint tightening, and then discuss the computation of the sensitivities of V θ , Q θ , π θ with respect to θ . A. Recursive Robust Constraint Satisfaction In order to guarantee constraint satisfaction for the real system (1) using predictions gi ven by the nominal model (20c), robust MPC explains the dif ference between predictions and actual state transitions by means of additi ve noise w ∈ W θ , with W θ satisfying ˆ S + ( s , a , θ ) = A s + B a + b + W θ ⊇ S + ( s , a ) . (23) W e remark that, by using the af ﬁne model (20c), conserv atism is introduced as any nonlinearity present in (1) will be ac- counted for by w . Set W θ is parametrized by parameter θ W . Common choices in robust MPC are to parametrize W θ using ellipsoids or polytopes; in this paper we consider the latter . In order to perform constraint tightening, i.e., the compu- tation of c k , we rely on the approach ﬁrst proposed by [25]. W e introduce the prediction error of the nominal model (20c): E k +1 = ( A − B K ) E k + W θ , E 0 = { 0 } , 5 where set E k predicts an outer approximation of the dispersion set around the predicted trajectory , i.e., S π MPC θ ,k k ⊆ x k + E k , k = 0 , . . . , N , where π MPC θ ,k := a k − K e k , e k ∈ E k and S π MPC θ k +1 = ˆ S + ( S π MPC θ ,k k , a k − K E k , θ ) , S π MPC θ ,k 0 = { s k } . Feedback matrix K is introduced in order to model the fact that any closed-loop strate gy will compensate for perturbations on the nominal model. F or ease of notation, we deﬁne C K := C − D K, A K := A − B K . Robust constraint satisfaction is then obtained provided that C x k + D u k + C K e k + ¯ c ≤ 0 , ∀ e k ∈ E k , (24) such that c k is obtained by adding the worst-case realization of C K E k to ¯ c . For each component i of the path constraint at time k we deﬁne d i,k := max e ( C K ) i e s . t . e ∈ E k = max w ( C K ) i k − 1 X j =0 ( A K ) j w j s . t . w j ∈ W θ . (25) W e lump all components d i,k in vector d k . Then, constraint satisfaction is obtained for all w k ∈ W θ if c k = ¯ c + d k . If W θ is a polytope, then (25) can be formulated as an LP; this implies that: (a) constraint tightening is relativ ely cheap to compute; (b) as detailed in Section IV, the SDC enforcement becomes easier to derive. In order to guarantee that Problem (20) remains feasible at all times for all w k ∈ W θ , one needs to impose ad-hoc terminal conditions. More speciﬁcally , the terminal set X f := { x | G x + g ≤ 0 } should be robustly in variant and output admissible, i.e., there exists a terminal control law κ f such that s k +1 ∈ X f and C s k + D κ f ( s k ) + ¯ c ≤ 0 for every s k ∈ X f . W e consider a linear control law κ f ( s ) = − K s , coinciding with the linear feedback used to stabilize the prediction error e . In order to compute set X f , we deﬁne X 0 := { x | C K x + c 0 ≤ 0 } , X k := { x ∈ A K X k − 1 ⊕ W θ | C K x + c k ≤ 0 } . Note that, by (24)-(25), x 0 ∈ X k implies C K ( x j + e j ) + ¯ c ≤ 0 , ∀ e j ∈ E j , ∀ i ∈ I k 0 . Set X ∞ is Maximal Rob ust Positive Inv ariant (MRPI) [28]: s k ∈ X ∞ ⇒ A j − k K s k ∈ X ∞ , C K s j + ¯ c ≤ 0 , ∀ j > k . Additionally , whenev er the system is stable and the origin is in the interior of the constraint set, the MRPI set is ﬁnitely determined [28, Theorem 6.3], i.e., ∃ k 0 < ∞ s.t. X k 0 ≡ X k 0 + i , for all i ≥ 1 . The stability requirement further motiv ates the introduction of feedback through matrix K . As proven in [28], for W θ polyhedral, G , ¯ g are giv en by G :=      C K C K A K . . . C K A k 0 K      , ¯ g :=      c 0 c 1 . . . c k 0      , (26) where we stress that c k = ¯ c + d k , such that, by using κ f = − K e one can spare a lar ge amount of computations, ¯ g being already computed. The condition G x N + ¯ g ≤ 0 would then guarantee rob ust constraint satisf action for all future times if s N = x N . Howe ver , s N = x N + e N , such that also the terminal constraint (20e) must be tightened. Analogously to the case of path constraints, we deﬁne g = ¯ g + h with h i,k := max w G i k 0 − 1 X j =0 A j K w j s . t . w j ∈ W θ . (27) Remark 2: T ypically , constraints that can ne ver become activ e are remo ved from (26), so as to reduce the dimension. Safety is guaranteed by the following result on tube MPC. Pr oposition 1 (Recursive F easibility): Assume that X f := { x | G x + g ≤ 0 } is RPI and Problem (20) is feasible at time k = 0 . Then Problem (20) is feasible for all w k ∈ W θ and all times k ≥ 0 . If moreover (23) holds, the real system (1) satisﬁes the safety constraint (19) at all times k ≥ 0 . Pr oof: The proof can be found in, e.g., [25]. This proof is valid in case θ is kept ﬁxed. A discussion on ho w to enforce recursive feasibility upon updates of θ is proposed in Section V, Remark 9. While the framew ork of robust linear MPC based on MRPI sets is well established, the computation of the parametric sensitivities of an MPC problem required to deploy most RL methods is not common. In particular , in the case of a tube- based formulation, also the constraint tightening procedure needs to be differentiated. This is also not common and deserves to be discussed in detail. W e therefore dev ote the next subsection to the computation of the derivati ve of the value and action-v alue function with respect to parameter θ . B. Differ entiability In order to be able to deploy RL algorithms to adapt parameter θ , we need to be able to differentiate the MPC scheme and, therefore, the constraint deﬁnition with respect to θ . In principle, θ could include A, B , K , but also all other parameters of the MPC formulation (22). In order to compute ∇ θ c k , ∇ θ C N , ∇ θ c N one can use results from parametric optimization to obtain the follo wing lemmas [29]. Consider a parametric NLP with cost φ θ , primal-dual vari- able y and parameter θ . W e refer to [30] for the deﬁnition of Lagrangian l 0 θ ( y ) , KKT conditions, Linear Independence Con- straint Qualiﬁcation (LICQ), Second-Order Suf ﬁcient Condi- tions (SOSC) and Strict Complementarity (SC). For a ﬁxed ac- tiv e set the KKT conditions reduce to the equality r 0 θ ( y ) = 0 . Lemma 1: Consider a parametric optimization problem with optimal primal-dual solution y ? . Assume that LICQ, SOSC and SC hold at y ? . Then, the following holds ∇ θ φ θ = ∇ θ l 0 θ ( y ) , ∂ r 0 θ ∂ y d d θ y ? = ∂ r 0 θ ∂ θ . 6 Pr oof: The result can be found in, e.g., [29]. Cor ollary 1: Assume that LICQ, SOSC and SC hold at the optimal solution of (20). Then, the value function V θ , action- value function Q θ and optimal solution y ? (therefore also policy π ) are dif ferentiable with respect to parameter θ , with: ∇ θ V θ ( s ) = ∇ θ ¯ l θ ( y ) , ∇ θ Q θ ( s , a ) = ∇ θ l θ ( y ) , (28a) ∂ r θ ∂ y d d θ y ? = ∂ r θ ∂ θ , (28b) where l θ is the Lagrangian of Problem (20), ¯ l θ is the La- grangian when constraint u 0 = a is eliminated, and r θ denotes the KKT conditions for the optimal active set. Remark 3: When solving an LP , QP or NLP using a second-order method, e.g., activ e-set or interior-point, the most expensi ve operation is the factorization of the KKT matrix ∂ r θ ∂ y . Once the matrix is f actorized, the solution of the linear system is computationally inexpensi ve. Therefore, the sensitivities of the solution are in general much cheaper to ev aluate than solving the problem itself. The sensiti vity of the optimal value function is even simpler to compute, since it consists in the differentiation of the Lagrangian, see (28a). In (28), r θ , l θ , and ¯ l θ depend on c k , g which, in turn, depend on θ as they are optimal values of parametric opti- mization problems (25) and (27). Consequently , one needs to ev aluate ∇ θ c k , ∇ θ g . In the following, we further detail the application of Lemma 1 to this case. W e consider only Problem (25), since the deriv ation for (27) is analogous. First, we state separability and, therefore, paral- lelizability of the computation of d k in the follo wing Lemma. Lemma 2: Each component of d k can be computed as d i,k = k − 1 X j =0 d i,k ,j , where d i,k ,j := max w j ( C K ) i A j K w j s . t . w j ∈ W θ . (29) Pr oof: Each term in the sum P k − 1 j =0 A j K w j depends only on variable w j , and the problem is fully separable. Then, Lemma 1 can be applied to obtain d d k d θ =  d d 1 ,k d θ , . . . , d d n c k ,k d θ  , d d i,k d θ = k − 1 X j =0 d d i,k,j d θ , d d i,k,j d θ = d l d θ ,i,j d θ , where l d θ ,i,j is the Lagrangian of Problem (29). Provided that a second-order method is used for solving Problem (29), then the matrix factorization is available and can be reused to compute the sensitivities at a ne gligible cost. For any function f ( c k ( θ )) , the chain rule yields d f d θ = d f d c k d c k d θ = d f d c k d d k d θ . Remark 4: As underlined above, Problems (29) can be solved in parallel not only for each prediction time k , but also for each component i . Moreover , if an acti ve-set solver is used, the activ e set can be initialized and the matrix factorization reused, such that often there will be no need for recomputing the factorization and computations can be done in an extremely efﬁcient manner . Finally , Problems (29) are very low dimensional and are therefore solved extremely quickly . C. Guaranteeing MPC F easibility and LICQ W e detail next two issues that can be easily encountered when deploying RL based on MPC (20). Since these are common, a simple solution that has become a standard in MPC is readily a vailable. a) MPC F easibility: Since the set of possible pertur- bations is not known a priori b ut rather approximated as W θ based on the collected samples, it cannot be excluded that some future sample w k will not be inside the set, i.e., w k / ∈ W θ . The set approximation must then be modiﬁed to include the ne w sample (see Section IV -B and (40)), but recursiv e feasibility is potentially lost, no action is computed and the controller stops working. b) Sensitivity Computation: The sensiti vity computation is v alid only if LICQ holds. Howe ver , Problem (20) is not guaranteed to satisfy LICQ. W e propose to address both issues by using a common approach in MPC, i.e., a constraint relaxation for Con- straints (20d) and (20e) with an e xact penalty [31]: variables σ k are introduced, the constraints are modiﬁed as C x k + D u k + c k ≤ σ k , (30a) G x N + g ≤ σ N , σ k ≥ 0 , k ∈ I N 1 , (30b) and the term P N k =1 ρ > σ k is added to the cost. Remark 5: Constraints only in volving the controls do not need to be relaxed, while any constraint in volving the states should not be imposed at k = 0 , since then LICQ can not be guaranteed ev en if the relaxation proposed above is deployed. W e formalize the result in the ne xt proposition. Pr oposition 2: Assume that Constraints (20d) and (20e) are relaxed as per (30) and the term P N k =1 ρ > σ k is added to the cost. Then, if ρ < ∞ is lar ge enough, the solution is unchanged whenev er feasible and recursi ve feasibility and LICQ are guaranteed. Pr oof: The ﬁrst result, i.e., solution equi valence when- ev er feasible and recursive feasibility is well-known [31] and [32, Theorem 14.3.1]. Regarding LICQ, we ﬁrst note that in a formulation without Constraints (20d) and (20e) LICQ holds by construction: Constraints (20b) and (20c) can be eliminated by condensing [33], yielding a problem with N n u unconstrained variables. The introduction of an y linearly independent set of pure control constraints does then not jeopardize LICQ by construction. Assume now to introduce a linearly-dependent constraint of the form (20d) or (20e) with Jacobian ν . By introducing slack v ariable σ , the ne w Jacobian becomes  ν 1  , which is by construction linearly independent with  ν 0  and, consequently , with the other constraints in the problem. As discussed in Section II, safety holds with probability P [ (11) | θ ] = 1 . This is an intrinsic issue of an y safety- enforcing control scheme, and the proposed relaxation only solves the issue of av oiding infeasibility for the MPC scheme. This is further discussed in Section V -B. 7 I V . S A F E D E S I G N C O N S T R A I N T A N D D A TA M A N AG E M E N T As explained in the pre vious section, safety is obtained if all possible state transitions are correctly captured in the MPC formulation, i.e., if W θ and, therefore, g θ is correctly identiﬁed. In other words, based on the collected data the SDC must be enforced in order to guarantee that the uncertainty described by W θ is representati ve of the real system (1). In this section we propose a sample-based formulation of the SDC (11) to be used within RL formulations (5). Many control systems are typically operated at high sam- pling rates and, consequently , data is collected at high rates. In order to be able to deal in real-time with the large amounts of data that cumulate, it is necessary to (a) retain only strictly relev ant data, and (b) compress the av ailable information using appropriately deﬁned data structures. In the follo wing, we ﬁrst discuss the data structures inv olved in RL-MPC and then discuss how to make an ef ﬁcient use of data in the context of the proposed MPC formulation. A. Set Membership and SDC Consider the (possibly very large) set of state transitions observed on the real system: D = { ( s 1 , a 1 , s 2 ) , . . . , ( s n , a n , s n +1 ) } . (31) The problem of enforcing the SDC (15) is related to the one of estimating the dispersion set S + from data, which has been studied in the context of set-membership system identiﬁcation (SMSI) [34]. Essentially , the dispersion set must satisfy s + ∈ ˆ S + ( s , a , θ ) , ∀ ( s , a , s + ) ∈ D , (32) i.e., θ must satisfy the SDC (15), and therefore belongs to set S D := { θ | g θ ( s k +1 , s k , u k ) ≤ 0 , ∀ ( s + , s , u ) ∈ D} . (33) W e should underline here the dif ference between ˆ S + deﬁned in (10) and S D . The former is an outer approximation of the set of all realizations s + , given state and action s , a , parametrized by θ . The latter instead describes the set of parameters θ such that all state transitions from D are contained in ˆ S + . In SMSI parameter θ is typically selected so as to obtain the smallest possible set S + . In the absence of speciﬁc information on the control task to be ex ecuted this is arguably a very reasonable approach. Howe ver , gi ven a speciﬁc control task to be e xecuted, better performance might be obtained by selecting parameter θ to approximate some part of S + accurately e ven at the cost of increasing the volume of ˆ S + . W e therefore provide the following deﬁnition. Deﬁnition 2 (Set Membership Optimality): Equiv alently to (2), we deﬁne the closed-loop cost J ( π θ ) associated with the policy π θ learned by RL when relying on the set ˆ S + ( s , a , θ ) . Then, parameter θ is (locally) optimal for the control task iff @ ¯ θ ∈ B  ( θ ) s . t . J ( π ¯ θ ) < J ( π θ ) , where B  ( · ) denotes a ball of radius  centered at θ . In other words, the optimal parameter minimizes the cost subject to the SDC (15). Based on this deﬁnition, the optimal set approximation is obtained by an SMSI which selects θ to maximize the closed-loop performance of the policy π θ . In this paper , we aim at doing so by means of RL. Every time the RL problem (18) is solved, one must ensure that θ ∈ S D , i.e., |D| constraints need to be included in the problem formulation. W ith large amounts of data, this could make the problem computationally intractable. In the following, we will analyze how to tackle this issue. While the previous dev elopments did not require any as- sumption on function g θ , in order to ef ﬁciently manage data we will assume hereafter that g θ is af ﬁne in s , a . Note that this choice is consistent with the choice of an af ﬁne model. The SDC then becomes S D := { θ | M ( s + − A s − B a − b ) ≤ m , ∀ ( s + , s , u ) ∈ D} , for some M , m possibly part of θ . Since both S D and W θ are deﬁned based on g θ , parameters M , m for the two sets coincide. Therefore, the SDC directly deﬁnes the uncertainty set W θ used by MPC to compute safe policies. Note that A , B , b , M , m can all be included in θ and, therefore, adapted by RL. Which parameter to adapt or keep constant is a design choice, see also Remark 7. B. Model-Based Data Compression In this section we discuss ho w to compress the av ailable data without loss of information to signiﬁcantly reduce the complexity of safe RL. It will become clear that the nominal model (20c) plays a very important role in this context, as it makes it possible to or ganize data such that it can be ef ﬁciently exploited. Unfortunately , this efﬁcienc y is lost if the model parameters are updated. Approaches to circumvent this issue can be devised and are the subject of ongoing research. W e begin the analysis by providing the follo wing deﬁnition. Deﬁnition 3 (Optimal Data Compr ession): Gi ven the se- lected parametrization and MPC formulation, an optimal data compr ession selects a dataset ¯ D ⊆ D such that:   ¯ D   = min ˆ D   ˆ D   s . t . S ˆ D ≡ S D . (34) Hence an optimal data compression retains the minimum amount of data required to represent the set S D . In the following, we assume that A , B , b are ﬁxed and exploit the model to achie ve an optimal data compression. The introduction of the nominal model (20c) allo ws us to restructure the data and only store the noise W := { w 0 , . . . , w n } , obtained from w k = s k +1 − ( A s k + B a k + b ) . (35) By using (35), we re write the SDC as S D = S W := { θ | M w ≤ m , ∀ w ∈ W } , (36) with θ W = ( M , m ) a component of θ . By exploiting the model we can therefore reduce the dimension of the space of the dataset from 2 n s + n a to n s , since ( s , a , s + ) ∈ R n s × n a × n s and w ∈ R n s . Note that this is only possible if one ne glects the dependence of W θ (and therefore W ) on s , a . The dispersion set approximation related with (36) is then ˆ S + ( s , a , θ ) = { A s + B a + b + w | ∀ w s . t . M w ≤ m } , (37) 8 such that (25), (27) used in constraint tightening are LPs. Additionally , it is cheap to (a) ev aluate if s + ∈ ˆ S + by using (35) and (b) verifying that θ ∈ S W , i.e., M w k ≤ m holds, since both operations only require few matrix-vector operations. Howe ver , while (a) requires a single ev aluation of the inequality in (37), (b) requires to ev aluate the inequality for each data point in (36). The use of the model and the con vexity assumption make it possible to further compress data: any point in the interior of the con vex hull of set W does not provide an y additional information regarding (36), such that the con vex hull ¯ W := Conv( W ) , carries all necessary information. Constructing the conv ex hull facet representation can be a rather expensiv e operation, which can in general not be done online. Howe ver , checking whether a sample w k lies inside the con vex hull ¯ W of a set of samples W can be done without building the facet representation of the con vex hull. T o this end, we deﬁne the LP ζ := min z | ¯ W | X i =1 z i s . t . ˆ w = | ¯ W | X i =1 z i w i , z ≥ 0 , (38) and exploit the follo wing result. Pr oposition 3: Assume that 0 ∈ Con v ( W ) , then ζ ≤ 1 ⇔ ˆ w ∈ Conv( W ) , with ζ from (38). Pr oof: The deﬁnition of con vex hull implies ˆ w ∈ Con v ( W ) ⇒ ∃ z ≥ 0 , k z k 1 = 1 s . t . ˆ w = P | ¯ W | i =1 z i w i , which proves ζ ≤ 1 ⇐ ˆ w ∈ Con v ( W ) . This also covers the implication ζ = 1 ⇒ ˆ w ∈ Conv( W ) . The implication ζ < 1 ⇒ ˆ w ∈ Conv( W ) is proven by noting that ζ w i ∈ Con v ( W ) , ∀ ζ ∈ [0 , 1] . The case ζ < 1 is thus immediately reconducted to the case ζ = 1 . Formulation (38) is v ery conv enient, as it is always feasi- ble, pro vided that W spans the full space R n w , which is a minimum reasonable requirement in this conte xt, since it guarantees that ev ery vector in R n w can be obtained as a linear combination of w i . Note also that the assumption 0 ∈ Con v ( W ) is a rather mild requirement on the accuracy of the model, as it always holds when standard system identiﬁcation techniques are deployed to estimate the model parameters A, B , b . W e pro ve the efﬁcienc y of the conv ex hull approach in the following theorem. Theor em 1 (Con vex Hull Optimality): Gi ven the dispersion set ˆ S + deﬁned in (37), the con vex hull ¯ W of the state transition noise W is an optimal data compression. Pr oof: The deﬁnition of ¯ W implies that any point in W can be obtained as the conv ex combination of points in ¯ W . By deﬁnition we ha ve that g w θ ( w 1 ) ≤ 0 , g w θ ( w 2 ) ≤ 0 , for all w 1 , w 2 ∈ W . For β ∈ [0 , 1] we deﬁne w β := β w 1 + (1 − β ) w 2 such that g w θ ( w β ) ≤ β g w θ ( w 1 ) + (1 − β ) g w θ ( w 2 ) ≤ 0 , since con ve xity of ¯ W is equi valent to conv exity of g w θ . This entails that w β ∈ W , such that any set ˆ S + computed using ¯ W satisﬁes s k +1 ∈ ˆ S + ( s k , u k , θ ) , ∀ ( s k +1 , s k , u k ) ∈ D , i.e., ∀ w ∈ W . Finally , by removing any data point from ¯ W one cannot guarantee that S ¯ W = S W , such that ¯ W solves (34). In order to construct the conv ex hull ¯ W , once a new sample w k is av ailable, we check whether w k ∈ ¯ W by solving (38). In case ζ ≤ 1 , ¯ W does not need to be updated; otherwise, the new point is added to W , such that ¯ W is implicitly updated. Note that the LP (38) has linear comple xity in the amount of vertices | ¯ W | . Remark 6: As data are collected, |D| becomes indeﬁnitely large. Even though ¯ W is optimal, also | ¯ W | can become indeﬁnitely large, though ar guably at a lower rate than |D | . This is a fundamental issue of any sample-based SDC or set membership approach, unless additional assumptions are introduced. Simple strate gies such as limiting the maximum amount of vertices/facets to be used for an approximation ˆ W of the conv ex hull could be devised, at the price of introducing conservatism. Such in vestigations are beyond the scope of this paper and require future research. Remark 7: One must tak e extra care if the model parameters A, B are updated. Indeed, if one updates A, B , b to ˜ A, ˜ B , ˜ b , then the ne w noise satisﬁes: ˜ w k = s k +1 − ˜ A s k − ˜ B a k − ˜ b , such that the noise update ˜ w k − w k = (( A − ˜ A ) s k + ( B − ˜ B ) a k + b − ˜ b ) is state-action dependent for all ˜ A 6 = A , ˜ B 6 = B . Therefore, any change in those parameters requires one to recompute the noise vector w k for all recorded state-action pairs. Updates in parameter b instead can be performed without much complica- tion and simply entail a shift of the noise, which is state-action independent, such that ˜ W = W + b − ˜ b . C. Further Observations on the Sample-Based SDC The con vex hull ¯ W is the smallest set encompassing all observed samples, i.e., it is the smallest set satisfying the SDC (15). Therefore, it is optimal both in terms of volume and in terms of cost, i.e., in the sense of Deﬁnition 2. Hence, one might be tempted to select W θ = ¯ W to reduce conserv atism in MPC (20) as much as possible. Ho wever , ¯ W is typically composed of a very large amount of facets, which renders the constraint tightening procedure very costly and results in a terminal constraint of high dimension. In practice, a set W θ of ﬁxed and low complexity is preferred, hence the importance of enforcing set membership optimality as per Deﬁnition 2, i.e., through RL. Thus far we hav e not discussed how the set W θ is represented. Howe ver , the choice of representation becomes important when dealing with large amounts of data. Moreover , the parametrization of W θ should be selected consistently with the algorithm used for solving the robust MPC Problem. 9 Con vex polytopes can be parametrized using the so-called (a) facet or (b) verte x representation. In case (a), the set is parametrized as W θ := { w | M w ≤ m } with parameter θ W = ( M , m ) . In case (b), the set is parametrized as W θ := { w | w ∈ Con v ( { v 0 , . . . , v m } ) } with parameter θ W = ( v 0 , . . . , v m ) , i.e., the vertices the polytope. Differently from W θ , for the con vex hull ¯ W we use the verte x representation. This allows a simpler and less computa- tionally demanding construction and incremental update of ¯ W . This advantage, ho wever , results in a more costly e valuation of w k ∈ ¯ W with respect to a facet representation. Finally , this makes it simple to enforce the SDC (36), which becomes S ¯ W =  θ | M w ≤ m , ∀ w ∈ ¯ W  . (39) The question on which representation is the most con venient for the con vex hull ¯ W is still open and will be further in vestigated in future research, which will also consider a combination of both the facet and vertex representation. W e provide next some fundamental observ ations regarding ¯ W , its cardinality and its relationship with the nominal model. Remark 8: As a consequence of Fundamental Limitation 1, in an y sample-based conte xt it is possible that a ne w sample w k falls out of the conv ex hull of previous samples, i.e., w k / ∈ ¯ W , such that potentially M i w k ≥ m i for some i . In this case, one needs to instantaneously adapt the SDC (15). A straightforward adaptation of S ¯ W is obtained as m i ← max( m i , M i w k ) . (40) This enlargement of the uncertainty set W θ entails an en- largement of the dispersion set, such that the constraints will be further tightened. This can jeopardize recursive feasibility of MPC (20) such that it is necessary to deploy the constraint relaxation proposed in Section III-C. V . S A F E R L M P C I M P L E M E N TA T I O N After having introduced all necessary components of the RL-MPC scheme, in this section we focus on the safe RL problem, discuss more in detail the RL problem and present some open research questions. Most recursiv e RL approaches use only the current sample and update the parameter recursi vely as θ k +1 = θ k + α ( θ ∗ − θ k ) , (41) with θ ∗ = θ k + ∇ θ ψ ( s k +1 , s k , a k , θ ) and ψ deﬁned, e.g., as per (6) or (7). This means that the update is the solution (or a step of stochastic gradient descent) of an unconstrained optimization problem. Since we need to enforce the SDC, the safe RL problem (18) yielding θ ∗ is constrained. By relying on the dev elopments of Section IV, we formulate (18) as min θ ψ ( s k +1 , s k , a k , θ ) , (42a) s . t . H  0 , P  0 , (42b) M w ≤ m , ∀ w ∈ ¯ W . (42c) Problem (42a) (without constraints) can be seen as a standard RL formulation [20]. Positiv e-deﬁnite constraints (42b) are introduced to make sure that MPC is properly formulated and easily and ef ﬁciently solv able. The SDC (15) is imposed in (42c). If the model parameters A , B are updated, the SDC cannot be implemented as (15), b ut rather as θ ∈ S D , with S D giv en by (33). Remark 9: Any update in parameter θ results in a modi- ﬁcation of set W θ , such that the existence of a solution to the MPC problem could be jeopardized. Sev eral options can be en visioned, including: (a) updating θ only when feasible, (b) reducing the step size until feasibility is recovered, (c) enforcing feasibility as an additional constraint in (42). Any of the strate gies (a)-(c) guarantees the feasibility of the updated robust MPC scheme since, provided that MPC is correctly formulated, by Proposition 1 initial feasibility entails recursive feasibility . In turn, this entails safety of the parameter update. It is worth mentioning that we did not encounter feasibility issues in the simulations we performed. Ne vertheless, future research will in vestigate this issue in depth. Remark 10: Both Q θ and J ( π θ ) depend on the constraint tightening procedure, such that their ﬁrst-order sensiti vities can be discontinuous. Consequently , also the ﬁrst-order sensiti vi- ties of ψ are non-smooth. This is the case when SC does not hold either in the MPC problem (20) or in the constraint tightening problems (25), (27), i.e., when some constraint(s) are weakly acti ve, such that inﬁnitesimal perturbations could cause an active-set change. In principle, this could create problems to algorithms for continuous optimization. Howe ver , the set on which SC does not hold has zero measure, such that the probability that one sample falls onto one of these points is zero, and the RL solution is unaf fected. Since the main concern of this paper is safety , we present next how to guarantee safety also during exploration, i.e., when the action applied to the system is not given by (20). Afterwards, we will further discuss open research questions and possible w ays of addressing them. A. Safe Exploration The proposed formulation guarantees safety during exploita- tion, i.e., whenev er the polic y is given by (21). Howe ver , speciﬁc care needs to be taken during exploration, i.e., when the optimal policy is perturbed, since also in this phase constraint satisfaction must not be jeopardized. Exploration is typically performed by picking a random action among all feasible actions with a given probability distribution. The main difﬁculty in enforcing safety is related to the complexity of the set of safe actions A ( s 0 ) := { a 0 | ∃ a 1 , . . . s.t. C s k + D a k + ¯ c ≤ 0 , ∀ w ∈ ¯ W } . Since this set is implicitly approximated within robust MPC, this issue can be tackled by using a modiﬁed v ersion of the robust MPC Problem (20): min x , u (20a) + f ( u 0 , q ) (43a) s . t . x 0 = s , (20c) , (20d) , (20e) , (43b) with either f ( u 0 , q ) := ρ k u 0 − q k , or f ( u 0 , q ) := q > u 0 , and a randomly chosen q . Note that performing constrained 10 Algorithm 1 RL MPC 1: if Explore then 2: Get a from MPC (43) 3: else 4: Get π θ ( s ) from MPC (20) 5: Observe s , a → s + 6: Compute w , solv e (38) to update ¯ W 7: Get Q θ ( s , a ) , ∇ θ Q θ ( s , a ) , V θ ( s + ) 8: if Solve RL then 9: Compute RL step: e.g., by solving (42) 10: if Update θ feasible then 11: Recompute constraint tightening (25), (27) exploration in policy gradient methods can produce a biased gradient estimation. Simple strategies to tackle this issue are proposed in [23]. Theor em 2: Consider a conv ergent RL scheme solving Problem (42) where (a) robust MPC (20) is used as func- tion approximator; (b) av ailable data is handled as detailed in Section IV; and (c) exploration is performed according to (43). Assume that all ne w data yields w k ∈ ¯ W . Then, the RL scheme is safe in the sense of Deﬁnition 1, i.e., P [ ∃ k s . t . ξ ( s k , a k ) > 0] ≤ 1 − η ( D ) . Additionally , the scheme is optimal in the sense of Deﬁnition 3. Pr oof: W e observe that η ( D ) quantiﬁes the probability that it is possible that a scenario is unaccounted for in the construction of ˆ S + . Moreov er , we observe that if ˆ S + does not account for all possible scenarios, then w k / ∈ ¯ W can happen for some k . As a result, one can verify that P  ∃ k s . t . w k / ∈ ¯ W  ≤ 1 − η ( D ) . W e note that, by Proposition 1, w k ∈ ¯ W implies that rob ust MPC (20) is recursiv ely feasible. Moreover , e xploration is performed using (43), i.e., (20) with a modiﬁed cost, such that recursiv e feasibility is also preserved by Proposition 1. This entails that a constraint violation can only occur if w k / ∈ ¯ W . The con verse, ho wev er , is not true in general, since there might exist w k / ∈ ¯ W which does not entail a constraint violation. Consequently , the probability that the constraints be violated at any time is bounded from abov e by 1 − η ( D ) : P [ ∃ k s . t . ξ ( s k , a k ) > 0] ≤ P  ∃ k s . t . w k / ∈ ¯ W  ≤ 1 − η ( D ) . Data compression optimality (Deﬁnition 3) is a direct conse- quence of Theorem 1. Remark 11: W e ought to stress that assuming that RL con verges to a minimum of E [ ψ ] is standard. Howe ver , as- suming con vergence to a local minimum of J can be a strong assumption, which is typically not met by Q -learning and SARSA, while actor-critic methods are typically expected to con verge to a local minimum of J for the giv en parametriza- tion. Optimality in the sense of Deﬁnition 2 (set membership optimality) could then be claimed for con ver gent actor -critic methods, while it is reasonable to e xpect a certain degree of suboptimality with Q -learning. W e provide an overvie w of the computations performed by RL-MPC in Algorithm 1. Note that we introduced lines 8 and 10 since the RL problem and, consequently , the recom- putation of the constraint tightening and terminal set might in principle be updated less often than the feedback sampling time. In that case, a batch RL approach could be used instead of a recursiv e one and the real-time requirements w ould only concern MPC. Additionally , we stress that in our Matlab implementation which was only partially made efﬁcient, the constraint tightening procedure on line 11 required approxi- mately twice the computation time of MPC on lines 2 and 4. The computation of the RL step on line 9 was approximately 5 times f aster . B. Discussion on the Pr oposed Appr oach W e discuss next a fe w open research questions and comment on possible e xtensions to the proposed framework. 1) Safety: Our safety deﬁnition is based on the assumption that all future samples satisfy w k ∈ ¯ W . In practice this assumption, though standard in rob ust MPC and SMSI, can be rather strong. Ho wever , this is a fundamental problem of safety: one can nev er guarantee a priori that a bounded set contains all possible future realizations of an unknown stochastic process. In Section III-C we hav e proposed a simple and practical approach to retain MPC feasibility even in the case w k / ∈ ¯ W . Ho wev er, with this approach safety is potentially lost ev ery time a sample falls out of the conv ex hull ¯ W . Safety , howe ver , is typically quickly recovered, as the RL parameter is instantaneously adjusted to account for the ne w sample. As also highlighted in Section II and Remark 8 this temporary loss of safety is a fundamental issue for an y sample- based approach, unless additional assumptions are introduced. While one could en vision the deriv ation of some measure of reliability of the identiﬁed set to be used to pro vide stronger guarantees, such in vestigation is be yond the scope of this paper and will be the subject of future research. 2) Appr oximation Quality: It has been proven in [11, Theorem 1] that, provided that the MPC parametrization is rich enough, the correct value and action-value functions can be recovered exactly . In the context of this paper , ho w- ev er , the parametrization of the noise set is approximate by construction. Consider partitioning the parameter vector as θ = ( θ c , θ W ) , where θ c is the parameter vector directly deﬁning the cost and constraint functions, similarly to the parametrization of [11], while θ W is the parameter vector parametrizing function g θ := g θ W . Then, provided that the parametrization of the cost and constraint functions through θ c is rich enough, the v alue, action-v alue functions and policy can be recov ered exactly on the feasible domain of the robust MPC (20). Therefore, as opposed to the result of [11], the equiv alence only holds on the subset of the feasible domain of the RL problem which can be described by the chosen parametrization of g θ . Since the hypothesis of a perfect parametrization is unreal- istic in most relev ant applications, Q learning and SARSA will in general not deliver the best performance that can be attained with the selected parametrization, since these algorithms aim at ﬁtting the action-v alue function rather than directly optimizing performance. It is therefore appealing to resort to policy 11 gradient approaches, which seek the direct minimization of J by manipluating θ . While in principle the proposed approach is well suited for policy gradient methods (both stochastic and deterministic), the main difﬁculty that needs extra care to be handled is related to the exploration strate gy to be deplo yed, which in general introduces a bias in the gradients, such that con vergence to a local optimum is hindered. This topic is in vestigated in [23]. 3) Model Adaptation: In principle, one could choose to let RL update an y of the parameters (22) of the MPC scheme (20), including the model parameters A , B , b . The model used by MPC plays an important role in (a) the deﬁnition of the value and action-v alue function and (b) the conserv atism of the uncertainty set approximation W θ . As observ ed in Remark 7, letting RL adapt the model parameters results in a large increase of computations. The feedback matrix K , howe ver , can be adapted by RL without major issues. Since it has an impact on the size of the uncertainty propagation and, through c k and g , on the cost, letting RL adapt K is expected to further reduce the closed-loop cost by reducing the tightening of the speciﬁc constraint components c k,i , g j corresponding to constraints, which are active in the e xecution of the the control task. Note that selecting a K that is optimal for the giv en task is an open issue which can be addressed in a data- driv en fashion using the approach we propose. In [11, Theorem 1] and [20, Corollary 2] it is proven that RL can ﬁnd Q θ = Q , V θ = V and π θ = π ev en without adjusting the model, provided that the parametrization is descripti ve enough. Ho wever , typically the parametrization is low-dimensional and cannot be expected to fulﬁll this requirement. Therefore, the question on whether adapting the model provides an advantage and on ho w to best adapt it is still open. One possibility to improve the approximation quality could be to carry a ﬁxed model parametrization A 0 , B 0 , b 0 to be used for safety and one or sev eral (possibly nonlinear) models f i, θ ( x i , u i ) , each associated with its cost to construct a more accurate prediction of the future cost distribution in a scenario- tree fashion. The in vestigation of alternativ e formulations will be the subject of future research. 4) Stability: The results on rob ust MPC are stronger than the one provided in Proposition 1: asymptotic stability is prov en, under the assumptions of having γ = 1 and a positiv e- deﬁnite stage cost, usually called tracking cost, as opposed to economic cost. In case of a tracking cost, the presence of a discount factor in the MPC formulation complicates the stability analysis. While one could foresee formulating (20) with γ = 1 , even though γ < 1 in the RL problem, it is not immediately clear to which extent this inconsistency will impact on the ability to learn the correct polic y . While providing approximation guarantees for Q -learning is ar guably nontrivial, polic y gradient methods will at least provide the best policy for the selected parametrization. In this conte xt that would be be best policy with stability guarantees. By exploiting the results of [35] it should be possible to prove that stabilizing linear control laws can be recovered e xactly for linear systems. Howe ver , a thorough inv estigation of this topic is the subject of ongoing research. The distinction between tracking and economic cost is relev ant in relation to the stability properties of the closed- loop system: a thorough discussion on economic MPC can be found in [36], [37], [38]. The linear -quadratic case, which is particularly rele vant for our framework, has been analyzed in, e.g., [35], [39], [40], [41], [42]. W e only recall here that all the conclusions drawn in [11] apply directly to our setup, i.e., the economic case can be reduced to the tracking case by additionally learning an initial cost. V I . S I M U L A T I O N R E S U L T S W e test our approach with two examples in simulations. First we consider a linear system to easily interpret the results. Then we consider a realistic example from the chemical industry: a nonlinear ev aporation process. A. Linear System W e consider a linear system having 2 states, such that we can easily visualize the beha vior of RL-MPC. Consider a simple linear system with dynamics and stage cost s + =  1 0 . 1 0 1  s +  0 . 05 0 . 1  a + w , ` ( s , a ) =  s − s r a − a r  > diag     1 0 . 01 0 . 01      s − s r a − a r  , where s = ( p, v ) . W e formulate a problem with prediction horizon N = 20 and introduce the state and control constraints − 1 ≤ s ≤ 1 , − 10 ≤ a ≤ 10 . The real noise set is selected as a regular octagon. In order to illustrate the ability of RL to adapt the approximation of the noise set, we select to parametrize W θ as a polytope with 4 facets. W e compute the terminal cost matrix P and feedback gain K using the LQR formulation resulting from the nominal model and stage cost. The terminal feedback K is used for constraint tightening, as per Section III. W e simulate the RL-MPC scheme ov er 200 time steps in a scenario without e xploration in which the reference is p r ( t ) =  1 25 ≤ t ≤ 120 − 1 otherwise , v r ( t ) = 0 , a r ( t ) = 0 . Since the setpoint reference is moving, for simplicity we im- pose the terminal constraint centered around the origin. While this does not affect recursiv e feasibility , practical stability is harder to prove in this case. A thorough analysis of this aspect goes beyond the scope of this paper , and we simply recall that such a terminal constraint induces a leaving arc in the MPC optimal control problem. This situation has been analyzed in [41], [40], [43] in the context of economic MPC, concluding that under suitable assumptions veriﬁed by the system considered here, practical stability is obtained. W e update θ = ( M , m ) according to (41)-(42) with α = 0 . 1 , ψ giv en by (6) ( Q learning). Problem (42) is solved to full con vergence at each step such that: (a) globalization techniques guarantee that θ ∗ is a better ﬁt than θ k for the sample at hand; (b) positiv e-deﬁniteness of the cost yields a well-posed MPC formulation; and (c) the choice of parameter 12 α is directly related to the horizon of a moving-a verage approximation of the expected value. The simulation results are displayed in Figures 2 and 3 in the form of snapshots comparing a representation of the solver at two different time instants. In particular , the constraint, RPI and terminal sets are represented together with the trajectories and constraint tightening. Associated with these quantities, we also display the uncertainty set with the drawn samples, their con vex hull and the approximation that RL learned. Initially , RL adapts the rather conservati ve approximation, therefore enlarging the terminal set by reducing the required constraint tightening. When the setpoint moves, there is initially no gain in modifying the set adaptation, until at k = 34 the tightened constraint p ≤ 1 becomes acti ve. At this point, RL starts adapting the set approximation to better capture the shape of the uncertainty set in its top-right part, which is opposite to the bottom-left corner which was better approximated before. Consequently , the terminal set is shifted towards the reference and the p ≤ 1 is tightened less, with an inﬁnite-horizon difference of approximately 0 . 019 . Since the tightening steady- state is quickly reached, this difference is visible in Figure 3. Note that m does in principle not need to be adjusted, as any adjustment in m can be equiv alently obtained by suitably rescaling M . W e performed the same simulations with θ = M and obtained qualitati vely equiv alent results, the detailed presentation of which we therefore omit. Finally , note that the con vex hull of W is a polytope with 28 facets, as opposed to the used approximation W θ , which only has 4 facets. W e ran an additional simulation in which we let θ = ( M , K ) , such that RL also adapts the feedback matrix K used for constraint tightening. W e remark that the problem of designing the terminal feedback and corresponding set X ∞ is nontri vial and many approaches ha ve been dev eloped. Howe ver , to the best of the authors’ kno wledge, none of these approaches explicitly accounts for the speciﬁc control task to be executed and rather aim at minimizing constraint tightening or maximizing the size of the RPI set. The simulation results are similar to the pre vious case. Howe ver , RL acts on K to enlarge the RPI and terminal set while reducing the required constraint tightening, therefore obtaining an increase in closed- loop performance. B. Evaporation Process Consider the ev aporation process modeled in [44], [45] and used in [46], [39] to demonstrate the potential of economic MPC in the nominal case. The model has states x = ( X 2 , P 2 ) (concentration and pressure); controls u = ( P 100 , F 200 ) (pres- sure and ﬂo w); and dynamics given by [46] M ˙ X 2 = F 1 X 1 − F 2 X 2 , C ˙ P 2 = F 4 − F 5 , (44) where T 2 = aP 2 + bX 2 + c , T 3 = dP 2 + e , λF 4 = Q 100 − F 1 C p ( T 2 − T 1 ) , T 100 = f P 100 + g , Q 100 = U A 1 ( T 100 − T 2 ) , U A 1 = h ( F 1 + F 3 ) , Q 100 = U A 1 ( T 100 − T 2 ) , U A 1 = h ( F 1 + F 3 ) , Q 200 = U A 2 ( T 3 − T 200 ) 1+ U A 2 / (2 C p F 200 ) , F 100 = Q 100 λ s , λF 5 = Q 200 , F 2 = F 1 − F 4 . The model parameters are giv en in [39]. Concentration X 1 , ﬂow F 2 , and temperatures T 1 , T 200 are stochastic. In this example, we model them as a uniform Fig. 2: Snapshots at k = 0 and k = 24 . T op ﬁgure (state space): state constraint set (light blue), state-input constraints using feedback matrix K (green), RPI set (red), terminal constraint set X f (cyan for k = 24 , transparent for k = 0 ). Predicted trajectory at k = 24 : initial state (red dot), pre- dicted trajectory (solid black line), reference s r (black circle), uncertainty tube (yellow). Bottom ﬁgure (noise space): true uncertainty set (transparent octagon), noise samples (black dots), vertices of their con vex hull (red dots), uncertainty set approximations W θ (cyan sets, with k = 0 in the background). A better approximation W θ ( k = 24 ) enlarges X f . distribution with X 1 = 5 ± 0 . 5 , F 1 = 10 ± 0 . 5 , T 1 = 40 ± 4 , T 200 = 25 ± 5 . Additionally , the controller must satisfy bounds (25 , 40) ≤ ( X 2 , P 2 ) ≤ (100 , 80) on the states and 100 ≤ ( P 100 , F 200 ) ≤ 400 on the controls. The stage cost is giv en by ` ( x , u ) = 10 . 09( F 2 + F 3 ) + 600 F 100 + 0 . 6 F 200 , which entails that the nominal model is optimally operated at the steady state x s = (25 , 49 . 74) , u s = (191 . 71 , 215 . 89) , with stage cost ` ( x s , u s ) = ` s . The system is linearized at x s , u s to obtain a linear nominal model. The terminal set is centered at x c = (29 , 53 . 57) , u c = (223 . 76 , 221 . 61) , which is a steady state for the linearized dynamics. Since the safety constraints are already linear , we introduce discount factor γ = 0 . 99 and formulate a robust linear MPC problem of the form (20). T o that end, we 13 Fig. 3: Snapshots at k = 34 and k = 109 : same con vention as Figure 2, with predicted trajectory at k = 109 . Both the RPI and terminal sets moved closer to the setpoint by a better approximation W θ for the speciﬁc control task (see the bottom plot). Moreover , next to s r the constraints are tightened more at k = 36 than afterwards. use the linearization at x s , u s , the quadratic cost obtained by applying the tuning procedure proposed in [39]. W ith the gi ven stage cost and linear model, we obtain the LQR feedback K which we use to stabilize the model error in computing the constraint tightening (25) and the terminal set (27). W e then use RL to adjust parameter θ = ( h , p , M , K ) , i.e., the cost gradient, the uncertainty set and the feedback matrix K . Since we formulate the rob ust MPC problem (20) using a positive- deﬁnite cost, while the RL cost ` is economic, we also learn a quadratic initial cost, as brieﬂy discussed in Section V -B and fully justiﬁed in [11]. W e let the Q -learning algorithm run for 10 4 samples with α = 10 − 2 . W e note that the parameters are no longer adjusted tow ards the end of the learning procedure, indicating con vergence of the algorithm. The RPI and terminal set, as well as the uncertainty set approximation W θ are displayed in Figure 4 at the beginning and at the end of the learning process. One can note that W θ becomes larger in order to better approximate the top left part of the uncertainty set. In combination with the adjustment of the feedback matrix K , this allo ws to shift the terminal set to wards the reference, as Fig. 4: Evaporation process. T op two ﬁgures: RPI and terminal set at the beginning and end of the learning process, same color conv ention as Figure 2. Bottom ﬁgure: the uncertainty set approximation W θ . shown in the top plot. V I I . C O N C L U S I O N S A N D F U T U R E W O R K In this paper we hav e presented an RL algorithm which is guaranteed to be safe in the sense of strictly satisfying a set of prescribed constraints giv en the a vailable data. W e have discussed both an innov ative function approximation based on robust MPC and an ef ﬁcient management of data which makes it possible to deal with very large amounts of data in real-time. While linear MPC can be successfully applied also to nonlinear systems, as demonstrated with the second 14 example, in case of strong nonlinearities linear MPC might fail at providing satisfactory performance and e ven safety . The proposed framework paves the road for sev eral exten- sions: (a) one can easily foresee the use of robust MPC based on scenario-trees (which is also suitable for nonlinear models); (b) the use of stochastic or deterministic polic y gradient is expected to further improv e performance, as discussed in Section II; (c) a formulation using the computational geometry approach for robustness and scenario trees for a reﬁned cost approximation could be envisioned. These research directions are the subject of ongoing research. R E F E R E N C E S A N D N OT E S [1] R. S. Sutton and A. G. Barto, Intr oduction to Reinforcement Learning , 1st ed. Cambridge, MA, USA: MIT Press, 1998. [2] D. Silver , A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser , I. Antonoglou, V . Panneershelv am, M. Lanc- tot, S. Dieleman, D. Gre we, J. Nham, N. Kalchbrenner , I. Sutske ver , T . Lillicrap, M. Leach, K. Kavukcuoglu, T . Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search, ” Natur e , vol. 529, pp. 484–503, 2016. [3] S. W ang, W . Chaovalitw ongse, and R. Babuska, “Machine learning algorithms in bipedal robot control, ” T rans. Sys. Man Cyber P art C , vol. 42, no. 5, pp. 728–743, Sep. 2012. [4] P . Abbeel, A. Coates, M. Quigley , and A. Y . Ng, “ An application of reinforcement learning to aerobatic helicopter ﬂight, ” in In Advances in Neural Information Processing Systems 19 . MIT Press, 2007, p. 2007. [5] R. S. Sutton, D. McAllester , S. Singh, and Y . Mansour, “Policy gradient methods for reinforcement learning with function approximation, ” in Pr oceedings of the 12th International Conference on Neural Information Pr ocessing Systems , ser. NIPS’99. Cambridge, MA, USA: MIT Press, 1999, pp. 1057–1063. [6] D. Silver , G. Lev er, N. Heess, T . Degris, D. W ierstra, and M. Riedmiller, “Deterministic policy gradient algorithms, ” in Proceedings of the 31st International Conference on Machine Learning , ser . ICML ’14, 2014, pp. I–387–I–395. [7] C. W atkins, “Learning from delayed rew ards, ” Ph.D. dissertation, King’ s College, Cambridge, 1989. [8] J. F . J. Garcia, “ A comprehensive surv ey on safe reinforcement learning, ” Journal of Machine Learning Resear ch , vol. 16, pp. 1437–1480, 2013. [9] G. Dalal, K. Dvijotham, M. V ecerik, T . Hester , C. Paduraru, and Y . T assa, “Safe Exploration in Continuous Action Spaces, ” 2018. [Online]. A vailable: https://arxiv .org/abs/1801.08757 [10] T . Pham, G. De Magistris, and R. T achibana, “Optlayer - practical con- strained optimization for deep reinforcement learning in the real world, ” in 2018 IEEE International Conference on Robotics and Automation (ICRA) , May 2018, pp. 6236–6243. [11] S. Gros and M. Zanon, “Data-Driven Economic NMPC using Rein- forcement Learning, ” IEEE T ransactions on Automatic Control , 2018, (in press). [12] S. Gros, M. Zanon, and A. Bemporad, “Safe Reinforcement Learning via Projection on a Safe Set: How to Achiev e Optimality?” in 21st IF A C W orld Congress , 2020, (submitted). [13] T . K oller , F . Berkenkamp, M. T urchetta, and A. Krause, “Learning- based Model Predictiv e Control for Safe Exploration and Reinforcement Learning, ” 2018, published on Arxiv . [14] A. Aswani, H. Gonzalez, S. S. Sastry , and C. T omlin, “Prov ably safe and rob ust learning-based model predicti ve control, ” Automatica , vol. 49, no. 5, pp. 1216 – 1226, 2013. [15] C. J. Ostafe w , A. P . Schoellig, and T . D. Barfoot, “Rob ust Constrained Learning-based NMPC enabling reliable mobile robot path tracking, ” The International Journal of Robotics Researc h , vol. 35, no. 13, pp. 1547–1563, 2016. [16] F . Berkenkamp, M. T urchetta, A. Schoellig, and A. Krause, “Safe Model-based Reinforcement Learning with Stability Guarantees, ” in Advances in Neural Information Pr ocessing Systems 30 , I. Guyon, U. V . Luxbur g, S. Bengio, H. W allach, R. Fergus, S. V ishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 908–918. [17] R. Murray and M. Palladino, “ A model for system uncertainty in reinforcement learning, ” Systems & Contr ol Letters , v ol. 122, pp. 24 – 31, 2018. [18] F . L. Le wis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control, ” IEEE Cir cuits and Systems Magazine , vol. 9, no. 3, pp. 32–50, 2009. [19] F . L. Lewis, D. Vrabie, and K. G. V amvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptiv e controllers, ” IEEE Contr ol Systems , vol. 32, no. 6, pp. 76–105, 2012. [20] M. Zanon, S. Gros, and A. Bemporad, “Practical Reinforcement Learn- ing of Stabilizing Economic MPC, ” in Proceedings of the Eur opean Contr ol Confer ence , 2019, (accepted). [21] B. Amos, I. D. J. Rodriguez, J. Sacks, B. Boots, and J. Z. Kolter , “Dif- ferentiable mpc for end-to-end planning and control, ” in Pr oceedings of the 32Nd International Confer ence on Neural Information Pr ocessing Systems , ser . NIPS’18. USA: Curran Associates Inc., 2018, pp. 8299– 8310. [22] S. Dean, S. T u, N. Matni, and B. Recht, “Safely Learning to Control the Constrained Linear Quadratic Regulator , ” in 2019 American Contr ol Confer ence (A CC) , July 2019, pp. 5582–5588. [23] S. Gros and M. Zanon, “Safe Reinforcement Learning Based on Robust MPC and Policy Gradient Methods, ” IEEE Tr ansactions on Automatic Contr ol (submitted) , 2019. [24] J. B. Rawlings and D. Q. Mayne, Model Pr edictive Contr ol: Theory and Design . Nob Hill, 2012, ch. Postface to model predictive control: theory and design. [25] L. Chisci, J. Rossiter , and G. Zappa, “Systems with persistent dis- turbances: predictive control with restricted constraints, ” A utomatica , vol. 37, pp. 1019–1028, 2001. [26] D. Q. Mayne, “Model predictive control: Recent dev elopments and future promise, ” A utomatica , vol. 50, no. 12, pp. 2967 – 2986, 2014. [27] D. Mayne, “Robust and stochastic mpc: Are we going in the right direction?” IF AC-P apersOnLine , vol. 48, no. 23, pp. 1 – 8, 2015, 5th IF AC Conference on Nonlinear Model Predictive Control NMPC 2015. [28] I. Kolmano vsky and E. Gilbert, “Theory and computation of disturbance in variant sets for discrete-time linear systems, ” Math. Pr obl. Eng. , vol. 4, no. 4, pp. 317–367, 1998. [29] C. B ¨ uskens and H. Maurer , Online Optimization of Large Scale Systems . Berlin, Heidelberg: Springer Berlin Heidelberg, 2001, ch. Sensiti vity Analysis and Real-Time Optimization of Parametric Nonlinear Program- ming Problems, pp. 3–16. [30] J. Nocedal and S. Wright, Numerical Optimization , 2nd ed., ser . Springer Series in Operations Research and Financial Engineering. Springer , 2006. [31] P . Scokaert and J. Rawlings, “Feasibility Issues in Linear Model Pre- dictiv e Control, ” AIChE Journal , vol. 45, no. 8, pp. 1649–1659, 1999. [32] R. Fletcher, Practical Methods of Optimization , 2nd ed. Chichester: W iley , 1987. [33] H. Bock, “Recent adv ances in parameter identiﬁcation techniques for ODE, ” in Numerical T reatment of In verse Problems in Differ ential and Inte gral Equations , P . Deuﬂhard and E. Hairer , Eds. Boston: Birkh ¨ auser , 1983, pp. 95–121. [34] D. Bertsekas and I. Rhodes, “Recursive state estimation for a set- membership description of uncertainty , ” IEEE T ransactions on Auto- matic Control , vol. 16, pp. 117–128, 1971. [35] M. Zanon, S. Gros, and M. Diehl, “Indeﬁnite Linear MPC and Ap- proximated Economic MPC for Nonlinear Systems, ” Journal of Pr ocess Contr ol , v ol. 24, pp. 1273–1281, 2014. [36] M. Diehl, R. Amrit, and J. Rawlings, “A Lyapuno v Function for Eco- nomic Optimizing Model Predictive Control, ” IEEE T rans. of Automatic Contr ol , v ol. 56, no. 3, pp. 703–707, March 2011. [37] R. Amrit, J. Rawlings, and D. Angeli, “Economic optimization using model predictiv e control with a terminal cost, ” Annual Reviews in Contr ol , v ol. 35, pp. 178–186, 2011. [38] M. A. M ¨ uller , D. Angeli, and F . Allg ¨ ower , “On necessity and robustness of dissipativity in economic model predictiv e control, ” IEEE Tr ansac- tions on Automatic Control , vol. 60, no. 6, pp. 1671–1676, 2015. [39] M. Zanon, S. Gros, and M. Diehl, “A Tracking MPC Formulation that is Locally Equi valent to Economic MPC, ” J ournal of Pr ocess Contr ol , 2016. [40] M. Zanon and T . F aulwasser , “Economic MPC without terminal con- straints: Gradient-correcting end penalties enforce asymptotic stability , ” Journal of Process Contr ol , v ol. 63, pp. 1 – 14, 2018. [41] T . Faulw asser and M. Zanon, “ Asymptotic Stability of Economic NMPC: The Importance of Adjoints, ” in Proceedings of the IF A C Nonlinear Model Predictive Contr ol Confer ence , 2018. [42] L. Gr ¨ une and R. Guglielmi, “Turnpik e properties and strict dissipativity for discrete time linear quadratic optimal control problems, ” SIAM 15 Journal on Contr ol and Optimization , v ol. 56, no. 2, pp. 1282–1302, 2018. [43] L. Gr ¨ une, “Economic receding horizon control without terminal con- straints, ” Automatica , vol. 49, pp. 725–734, 2013. [44] F . Y . W ang and I. T . Cameron, “Control studies on a model evaporation process constrained state driving with conv entional and higher relati ve degree systems, ” Journal of Pr ocess Contr ol , vol. 4, pp. 59–75, 1994. [45] C. Sonntag, O. Stursberg, and S. Engell, “Dynamic Optimization of an Industrial Evaporator using Graph Search with Embedded Nonlinear Programming, ” in Pr oc. 2nd IF A C Conf . on Analysis and Design of Hybrid Systems (ADHS) , 2006, pp. 211–216. [46] R. Amrit, J. B. Rawlings, and L. T . Biegler , “Optimizing process eco- nomics online using model predicti ve control, ” Computers & Chemical Engineering , vol. 58, pp. 334 – 343, 2013. S ´ ebastien Gros receiv ed his Ph.D degree from EPFL, Switzerland, in 2007. After a journey by bicycle from Switzerland to the Everest base camp in full autonomy , he joined a R&D group hosted at Strathclyde University focusing on wind turbine control. In 2011, he joined the uni versity of KU Leuven, where his main research focus was on optimal control and fast MPC for complex mechan- ical systems. He joined the Department of Signals and Systems at Chalmers Univ ersity of T echnology , G ¨ oteborg in 2013, where he became associate Prof. in 2017. He is now full Prof. at NTNU, Norway and guest Prof. at Chalmers. His main research interests include numerical methods, real-time optimal control, reinforcement learning, and the optimal control of energy-related applications. Mario Zanon received the Master’ s degree in Mechatronics from the Uni versity of Trento, and the Dipl ˆ ome d’Ing ´ enieur from the Ecole Centrale Paris, in 2010. After research stays at the KU Leuven, Univ ersity of Bayreuth, Chalmers Univ ersity , and the Univ ersity of Freib urg he received the Ph.D. de gree in Electrical Engineering from the KU Leuven in November 2015. He held a Post-Doc researcher position at Chalmers University until the end of 2017 and is now Assistant Professor at the IMT School for Advanced Studies Lucca. His research interests include numerical methods for optimization, economic MPC, optimal control and estimation of nonlinear dynamic systems, in particular for aerospace and automotiv e applications.

Safe Reinforcement Learning Using Robust MPC

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment