Towards Safe Reinforcement Learning Using NMPC and Policy Gradients: Part II - Deterministic Case

1 T o wards Safe Reinforcement Learning Using NMPC and Polic y Gradients: Part II - Deterministic Case S ´ ebastien Gros, Mario Zanon Abstract —In this paper , we present a methodology to de- ploy the deterministic policy gradient method, using actor -critic techniques, when the optimal policy is approximated using a parametric optimization problem, where safety is enfor ced via hard constraints. For continuous input space, imposing safety restrictions on the exploration needed to deploying the deter- ministic policy gradient method poses some technical difﬁculties, which we addr ess here. W e will in vestigate in particular policy approximations based on robust Nonlinear Model Predicti ve Control (NMPC), where safety can be treated explicitly . For the sake of bre vity , we will detail the construction of the safe scheme in the rob ust linear MPC context only . The extension to the nonlinear case is possible b ut more complex. W e will additionally present a technique to maintain the system safety throughout the learning process in the context of rob ust linear MPC. This paper has a companion paper treating the stochastic policy gradient case. Index T erms —Safe Reinf orcement Lear ning, r obust Model Pr e- dictive Control, stochastic policy gradient, interior -point method. I . I N T RO D U C T I O N Reinforcement Learning (RL) is a powerful tool for tackling Markov Decision Processes (MDP) without depending on a detailed model of the probability distributions underlying the state transitions. Indeed, most RL methods rely purely on observed state transitions, and realizations of the stage cost L ( s , a ) ∈ R assigning a performance to each state-input pair s , a (the inputs are often labelled actions in the RL community). RL methods seek to increase the closed-loop performance of the control policy deployed on the MDP as observations are collected. RL has drawn an increasingly large attention thanks to its accomplishments, such as, e.g., making it possible for robots to learn to w alk or ﬂy without supervision [19], [1]. Most RL methods are based on learning the optimal control policy for the real system either directly , or indirectly . Indirect methods typically rely on learning a good approximation of the optimal action-value function underlying the MDP . The optimal policy is then indirectly obtained as the minimizer of the value-function approximation over the inputs a . Direct RL methods, based on policy gradients, seek to adjust the parameters θ of a gi ven policy π θ such that it yields the best closed-loop performance when deployed on the real system. An attractiv e advantage of direct RL methods ov er indirect ones is that they are based on formal necessary conditions of optimality for the closed-loop performance of π θ , and there- S ´ ebastien Gros is with the Department of Cybernetic, NTNU, Norway . Mario Zanon is with the IMT School for Advanced Studies Lucca, Lucca 55100, Italy . fore asymptotically (for a large enough data set) guarantee the (possibly local) optimality of the parameters θ [18], [17]. RL methods often rely on Deep Neural Networks (DNN) to carry the policy approximation π θ . While effecti ve in practice, control policies based on DNNs provide limited opportunities for formal veriﬁcations of the resulting closed-loop behavior , and for imposing hard constraints on the ev olution of the state of the real system. The development of safe RL methods, which aims at tackling this issue, is currently an open ﬁeld or research [12]. In this paper, we in vestigate the use of constrained paramet- ric optimization problems to carry the policy approximation. The aim is to impose safety by means of hard constraints in the optimization problem. Most RL methods require exploration, i.e., the inputs applied to the real system must differ from the policy π θ in order to identify changes in the policy parameters θ that can yield a higher closed-loop performance. Exploration is typically performed via stochastic disturbances of the polic y . W e will show in this paper that the presence of hard con- straints distorts the statistics of the exploration, and that some corrections must in theory be introduced in the classic tools underlying the deterministic policy gradient method to account for this distortion. W e propose computationally efﬁcient tools to implement these corrections, based on parametric Nonlinear Programming techniques, and interior-point methods. Robust Nonlinear Model Predictiv e Control (NMPC) is arguably an ideal candidate for forming the constrained opti- mization problem supporting the policy approximation. Robust NMPC techniques provide safety guarantees on the closed- loop behavior of the system by explicitly accounting for the presence of (possibly stochastic) disturbances and model inac- curacies. A rich theoretical framework is av ailable on the topic [14]. The policy parameters θ will then appear as parameters in the NMPC model(s), cost function and constraints. Updates in the polic y parameters θ will then be driv en by the deterministic policy gradient method to increase the NMPC closed-loop performance, and constrained by the requirement that the NMPC model inaccuracies must be adequately accounted for in forming the robust NMPC scheme. For the sake of brevity and simplicity , we will detail these questions in the speciﬁc linear MPC case. The extension to the nonlinear case is arguably possible, but more complex. This paper has a companion paper [10] treating the same problem in the context of the stochastic policy gradient ap- proach. The two papers share the same background material, and some similar techniques. Ho wev er, the theory allo wing the deployment of the two policy gradient techniques is intrinsically different. The paper is structured as follows. Section II provides 2 some background material. Section III details the safe policy approximation we propose to use. Section IV establishes the basic properties that a safe exploration must fulﬁl in order to be able to build a correct policy gradient estimation with standard RL tools. Section V presents an optimization-based approach to generate an exploration satisfying these properties. Section VI discusses a technique to enforce safety in the RL- based learning process in the context of robust MPC. Section VII proposes an example of simulation using the principles dev eloped in this paper . I I . B AC K G RO U N D O N M A R KOV D E C I S I O N P RO C E S S E S This section provides background material on Markov De- cision Processes (MDP), and on their restriction to a safe set. W e also provide a brief introduction to the deterministic policy gradient method. A. Markov Decision Pr ocesses In the following, we will consider that the dynamics of the real system are described as a Markov Process (MP), with state transitions having the underlying conditional probability density: P [ s + | s , a ] (1) denoting the probability density of the state transition s , a → s + . W e will furthermore consider deterministic policies: a = π ( s ) (2) associating an input (a.k.a. action) a ∈ R n a to any feasible state s ∈ R n s . In the following, it will be additionally useful to introduce the concept of stochastic policy π [ a | s ] : R n a × R n s → R + (3) denoting the probability density of selecting a giv en input a for a given state s . It is useful to observe that any deterministic policy (2) can be deﬁned as a stochastic policy using: π [ a | s ] = δ ( a − π ( s )) (4) where δ is the Dirac function. All the deﬁnitions below then readily apply to both (3) and (2) by using (4). Let us consider the distribution of the MP resulting from the state transition (1) and policy (3): P [ s k | π ] = Z k − 1 Y i =0 P [ s i +1 | s i , a ] P [ s 0 ] π [ a i | s i ] (5) d s 0 ,...,k − 1 d a 0 ,...,k − 1 where P [ s 0 ] denotes the probability distribution of the initial conditions s 0 of the MP . W e can then deﬁne the discounted expected v alue of the MP distribution under policy π , labelled E π [ . ] , which reads as: E π [ ζ ] := ∞ X k =0 Z γ k ζ ( s k , a k ) P [ s k | π ] π [ a k | s k ] d s k d a k (6) for any function ζ . In the following we will assume the local stability of the MP under the selected policies. More speciﬁcally , we assume that π is such that: lim ˜ π → π E ˜ π [ ζ ] = E π [ ζ ] , (7) for an y function ζ such that both sides of the equality are ﬁnite. Assumption (7) is underlying standard RL algorithms, though it is often left implicit, and allows us to draw equiv alences between a policy and disturbances of that policy , which is required in the context of policy gradient methods. It can be construed as a local regularity assumption on E π [ . ] that, e.g., holds if the system dynamics in closed-loop with policy π are stable. For a given stage cost function L ( s , a ) and a discount factor γ ∈ [0 , 1] , the performance of a policy π is giv en by the discounted cost: J ( π ) = E π [ L ] (8) The state transition (1), stage cost L and discount factor γ deﬁne a Markov Decision Process, with an underlying optimal policy giv en by: π ? = arg min π J ( π ) (9) It is useful to underline here that, while (9) may hav e sev- eral (global) solutions, any fully observable MDP admits a deterministic policy π ? among its solutions. The (scalar) v alue function and action-v alue functions as- sociated to a giv en policy π are giv en by [4], [6], [3]: Q π ( s , a ) = L ( s , a ) + γ E [ V π ( s + ) | s , a ] , (10a) V π ( s ) = E a ∼ π [ ·| x ] [ Q π ( s , a )] , (10b) where the e xpected value in (10a) is taken over state transitions (1). The advantage function is then deﬁned as: A π ( s , a ) = Q π ( s , a ) − V π ( s ) (11) and provides the value of using input a in a given state s compared to using the policy π . Furthermore A π ? ( s , a ) ≥ 0 , ∀ s , a (12) holds at the deterministic optimal policy π ? . B. P olicy appr oximation and Deterministic policy gradient In most cases, the optimal policy π ? cannot be computed. It is then useful to consider approximations π θ of the optimal policy , carried by a (possibly large) set of parameters θ . The optimal parameters θ ? are then giv en by: θ ? = arg min θ J ( π θ ) (13) The gradient associated to the minimization problem (13) is referred to as the deterministic policy gradient and is giv en by [17]: ∇ θ J ( π θ ) = E π θ [ ∇ θ π θ ∇ a A π θ ] , (14) where ∇ u A π θ is the gradient of the advantage function (11). Reinforcement Learning algorithms based on the deterministic 3 policy gradient are forming estimations of (14) using ob- served state transitions. The gradient of the advantage function ∇ a A π θ in (14) is also estimated from the data. One can observe that for any deterministic policy π θ , the advantage function satisﬁes A π θ ( s , π θ ( s )) = 0 , ∀ s , (15) hence in order to build estimations of the gradient ∇ u A π θ , one needs to select inputs a that depart from the deterministic policy π θ , so as to be able to observe variations of A π θ , see (12), and estimate its gradient. Selecting inputs a 6 = π θ ( s ) in order to build the gradient ∇ u A π θ is referred to as explor ation . C. Safe set In the following, we will assume the existence of a (pos- sibly) state-dependent safe set labelled S ( s ) ⊆ R n a , subset of the input space. See [20], [10] for similar discussions. The notion of safe set will be used here in the sense that any input selected such that a ∈ S ( s ) yields safe trajectories with a unitary probability . The construction of the safe set is not the object of this paper . Ho wever , we can nonetheless propose some pointer to how such a set is constructed in practice. Let us consider the constraints h s ( s , a ) ≤ 0 describing the subset of the state-input space deemed feasible and safe. Constraints h can include pure state constraints, describing the safe states, pure input constraints, typically describing actuators limitations, and mixed constraints, where the states and inputs are mixed. For the sake of simplicity , we will assume in the following that h s is conv ex. A common approach to build practical or inner approxima- tions of the safe set S ( s ) is via verifying the safety of an input a explicitly over a ﬁnite horizon via predictive control techniques. This veriﬁcation is based on forming the support of the Markov Process distribution over time, starting from a giv en state-input pair s , a . Consider the set X + ( s , a ) , support of the state transition (1), X + ( s , a ) = { s + | P [ s + | s , a ] > 0 } (16) Labelling X k ( s , a , π s ) the support of the state of the Markov Process at time k , starting from s , a and ev olving under policy π s , the set X k is then giv en by the recursion: X k ( s , a , π s ) = X + ( X k − 1 , π s ( X k − 1 )) , (17) with the boundary condition X 1 = X + ( s , a ) . An input a is in the safe set S ( s ) if h s ( s , a ) ≤ 0 and if there exist a policy π s such that h s ( s k , π s ( s k )) ≤ 0 , ∀ s k ∈ X k ( s , a , π s ) , (18) for all k ≥ 1 . This veriﬁcation is typically performed in practice via scenario trees, tube-based approaches, or direct approximations of the set X k via e.g. ellipsoids or poly- topes [14]. In that context, policy π s should ideally be identical to π θ . Howe ver , for computational reasons, it is typically selected a priori to stabilize the system dynamics, and possibly optimized to minimize the size of the sets X k . Due to the safety requirement, both the policy π θ and the exploration performed by the RL algorithm will ha ve to respect a ∈ S ( s ) , and can therefore not be chosen freely . I I I . O P T I M I Z A T I O N - B A S E D S A F E P O L I C Y In this paper , we will consider parametrized deterministic policies π θ based on parametric Nonlinear Programs (NLPs), and more speciﬁcally based on robust NMPC schemes. This approach is formally justiﬁed in [9]. More speciﬁcally , we will consider a policy approximation π θ = u ? 0 ( s , θ ) , (19) where u ? 0 ( s , θ ) is the ﬁrst n a entries of u ? ( s , θ ) generated by the parametric NLP: u ? ( s , θ ) = arg min u Φ( x , u , θ ) (20a) s . t . f ( x , u , s , θ ) = 0 , (20b) h ( x , u , θ ) ≤ 0 . (20c) W e will then consider that the safety requirement π θ ( s ) ∈ S ( s ) is imposed via the constraints (20b)-(20c). A special case of (20) is an optimization scheme in the form: u ? 0 ( s , θ ) = arg min u 0 Φ( s , u 0 , θ ) (21a) s . t . h ( s , u 0 , θ ) ≤ 0 , (21b) where h ≤ 0 ought to ensure that π θ ( s ) = u ? 0 ( s , θ ) ∈ S ( s ) . While most of the discussions in this paper will take place around the general formulation (20), a natural approach to formulate constraints (20b)-(20c) such that policy (19) is safe is to build (20) using robust (N)MPC techniques. A. P olicy appr oximation based on r obust NMPC The imposition of safety constraints can be treated via robust NMPC approaches. Robust NMPC can take dif ferent forms [14], all of which can be eventually cast in the form (20). One form of robust robust NMPC schemes is based on scenario trees [16], which take the form: u ? ( s , θ ) = arg min u N M X j =1 V j ( x j,N , θ ) + N − 1 X k =0 ` j ( x j,k , u j,k , θ ) ! (22a) s . t . x j,k +1 = F j ( x j,k , u j,k , θ ) , x j, 0 = s , (22b) h s ( x j,k , u j,k , θ ) ≤ 0 , (22c) e ( x j,N , θ ) ≤ 0 , (22d) N ( u ) = 0 , (22e) where F 1 ,...,N M are the N M different models used to support the uncertainty , while F 0 is a nominal model supporting the NMPC scheme. T rajectories x j,k and u j,k for j = 1 , . . . , N M are the different models trajectories and the associated inputs. Functions ` 1 ,...,N M , V 1 ,...,N M the (possibly different) stage costs and terminal costs applying to the dif ferent models. The non-anticipativity constraints (22e) support the scenario-tree structure. For a given state s and parameters θ , the NMPC scheme (22) deliv ers the input proﬁles u ? j ( s , θ ) =  u ? j, 0 ( s , θ ) , . . . , u ? j,N ( s , θ )  , (23) with u ? j,i ( s , θ ) ∈ R n a , and (22e) imposes u ? 0 ( s , θ ) := u ? i, 0 ( s , θ ) = u ? j, 0 ( s , θ ) , ∀ i, j. (24) 4 As a result, the NMPC scheme (22) generates a parametrized deterministic policy according to: π θ ( s ) = u ? 0 ( s , θ ) ∈ R n a . (25) Policy π s is implicitly deployed in (22) via the scenario tree. If the dispersion set X + is kno wn, the multiple models F 1 ,...,N M and terminal constraints (22d) can be chosen such that the robust NMPC scheme (22) deli vers π θ ( s ) ∈ S ( s ) . Unfortu- nately , this selection can be difﬁcult in general. W e turn next to the rob ust linear MPC case, where this construction is much simpler . B. Safe r obust linear MPC Exhaustiv ely discussing the construction of the safe scenario tree in (22) for a given dispersion set X + ( s , a ) is beyond the scope of this paper . The process can be fairly in volved, and we refer to [16], [2] for detailed discussions. For the sake of brevity , we will focus on the linear MPC case, whereby the MPC models F 1 ,...,N M and policy π s are linear . Let us consider the following outer approximation of the dispersion set X + : X + ( s , a ) ⊆ F 0 ( s , a , θ ) + W , ∀ s , a (26) where we use a linear nominal model F 0 and a polytope W of vertices W 1 ,...,N M that can be construed as the extrema of a ﬁnite-support process noise, and which can be part (or functions of) the MPC parameters θ . F or the sake of simplicity , we assume that W is independent of the state-input pair s , a . The models F 1 ,...,N M can then be built based using: F i = F 0 + W i , i = 1 . . . N M (27) and using the linear policy: π s ( x j,k , u 0 ,k , x 0 ,k ) = u 0 ,k − K ( x j,k − x 0 ,k ) (28) where matrix K can be part (or function of) the MPC parameters θ . One can then verify by simple induction that: X k ( s , a , π s ) ⊆ Con v ( x 1 ,k , . . . , x N M ,k ) , (29) for k = 0 , . . . , N + 1 , where Con v is the con ve x hull of the set of points x 1 ,k , . . . , x N M ,k solution of the MPC scheme (22). The terminal constraints (22d) ought then be constructed as, e.g., via the Rob ust Positi ve In variant set corresponding to π s in order to establish safety beyond the MPC horizon. For h s con vex, the MPC scheme (22) deliv ers safe inputs [14], [13]. When the dispersion set X + ( s , a ) can only be inferred from data, condition (26) arguably translates to [5]: s k +1 − F 0 ( s k , a k , θ ) ∈ W , ∀ ( s k +1 , s k , a k ) ∈ D , (30) where D is the set of N D observed state transitions. Condition (30) translates into a sample-based condition on the admissible parameters θ , i.e., it speﬁcies the parameters that are safe with r espect to the state transitions observed so far . Condition (30) tests whether the points s k +1 − F 0 ( s k , a k , θ ) are in the polytope W , which can be easily translated into a set of algebraic constraints imposed on θ . This observation will be used in Section III-B to build a safe RL-based learning. W e ought to underline here that building F 0 , W based on (30) ensures the safety of the robust MPC scheme (22) only for an inﬁnitely large, and sufﬁciently informative data set D . In practice, using a ﬁnite data set entails that safety is ensured with a probability less than 1. The quantiﬁcation of the probability of having a safe policy for a giv en, ﬁnite data set D is beyond the scope of this paper, and is arguably best treated by means of the Information Field Theory [8]. The extension of the construction of a safe MPC presented in this section to the general NMPC case is theoretically feasible, but can be computationally intensiv e in practice. This aspect of the problem is beyond the scope if this paper . I V . S A F E E X P L O R AT I O N In this section, we in vestigate the deployment of the deter- ministic policy gradient method [17] when the input space is continuous and restricted by some safety constraints. W e will show that in that case the classic tools used in the deterministic policy gradient method need some corrections. In order to estimate the gradient of the advantage function ∇ a A π θ , the inputs a applied to the real system must differ from the actual policy π θ ( s ) , such that the advantage function ˆ A w π θ ( s , a ) is not trivially zero on the system trajectories, see (15). The exploration e := a − π θ ( s ) , (31) is typically generated via selecting the inputs a using a stochastic policy π σ θ [ a | s ] , where σ relates to its covariance, having most of its mass in a neighborhood of π θ ( s ) . In the following, we will need the conditional mean and cov ariance of e : η e ( s ) = E [ e | s ] , (32a) Σ e ( s ) = E h ( e − η e ) ( e − η e ) >    s i . (32b) The restriction of the exploration e to yield inputs a in the safe set S ( s ) can cause the exploration e to not be centred, i.e., η e = 0 may not hold, and the cov ariance Σ e ( s ) can be restricted by the safe set. This observation is illustrated in Fig. 1, where the trivial static problem: π θ = arg min u 1 2     u −  θ 1 θ 2      2 (33a) s . t . k u k 2 ≤ θ 3 (33b) was used, and the exploration generated via (46)-(47) detailed below . The fact η e and Σ e cannot be fully chosen in the presence of constraints ought to be accounted for when forming esti- mations of the gradient of the advantage function ∇ a A π θ in order to avoid biasing the estimation of the policy gradient (14). W e de velop next conditions on the exploration such that ∇ a A π θ can be estimated correctly . A. Estimation of ∇ a A π θ A dif ﬁculty arises here when forming estimations of ∇ a A π θ using restricted explorations, which we detail hereafter . It 5 Fig. 1: Illustration of the mean ( + symbol) and cov ariance (ellipsoids) (32) of the exploration (31) subject to safety constraints (solid red line). The deterministic policy is depicted as the × symbol, for the trivial problem (33), and different values of θ 1 , 2 and θ 3 = 1 . Here the exploration is generated via (46)-(47) detailed below . One can observe that η e = 0 is not possible to achieve when the policy is on the constraints, and that the covariances are impacted by the presence of the constraint. On the right-side graph, one can see ho w the co variance collapses when π θ strongly acti vates the constraint. is well known that estimating ∇ a A π θ directly is very difﬁ- cult, hence one typically considers estimating the adv antage function A π θ instead, from which the gradient ∇ a A π θ is ev aluated. The estimation of the advantage function is carried by the function approximator ˆ A w π θ , parametrized by w . One then seeks a solution to the least-squares problem [17]: w = arg min w 1 2 E π σ θ   Q π θ − ˆ V π θ − ˆ A w π θ  2  , (34) where the value function estimation ˆ V π θ ≈ V π θ is a baseline supporting the ev aluation of w . T emporal-Difference or Monte Carlo techniques [18] are typically used to tackle (34). The policy gradient is then ev aluated as: \ ∇ θ J ( π θ ) = E π σ θ h ∇ θ π θ ∇ a ˆ A w π θ i . (35) A compatible linear approximator ˆ A w π θ is typically pre- ferred, in order for (35) to match (14). In this paper , we propose to use the following function approximator inspired from [17]: ˆ A w π θ ( s , a ) = w > ∇ θ π θ M ( a − π θ − c ) , (36) where M and c are a (possibly) state-dependent R n a × n a symmetric matrix and R n a vector . In [17], M = I and c = 0 is used but we will show in the following that we need in principle to make a different choice when the input is restricted to S ( s ) . The following proposition deliv ers general conditions in order for (35) to match (14) when the exploration is restricted. Before delivering the Proposition, let us establish some key assumptions. Assumption 1: The following hold: a. Q π θ ( s , a ) is almost ev erywhere at least twice differen- tiable in a on S ( s ) for all feasible s , and its gradient with respect to a are polynomially bounded. b . Stability assumption (7) holds. c. The MDP state probability density is bounded, i.e. it can not hold Dirac-like densities. d. e results from the transformation of a Normal distribution via a polynomially bounded function e. The following limits: lim σ → 0 1 σ ( η e − c ) = 0 (37a) lim σ → 0 1 σ M Σ e = I (37b) hold for all feasible s . Remark: W e ought to underline here that assumptions 1a.- 1d. are typically needed in RL algorithms, though often not explicitly stated. Assumption 1e. is not standard, and relates speciﬁcally to the problem of having restricted exploration. It essentially requires c to be an asymptotic estimation of the exploration mean η e and M to be a (scaled) asymptotic estimaton of the inv erse of the exploration covariance Σ e . W e should additionally observe that when the exploration is centred and isotropically distributed (i.e. η e = 0 , Σ e = σ I ), then M = I , c = 0 satisfy (37), and the results found in [17] hold. Pr oposition 1: Under Assumption 1, the deterministic polic y gradient estimation (35) is asymptotically exact, i.e. lim σ → 0 \ ∇ θ J ( π θ ) = ∇ θ J ( π θ ) . (38) Pr oof: Using (36), the solution of the least-squares prob- lem (34) satisﬁes the stationarity condition: E π σ θ h ∇ θ π θ M ( e − c )  Q π θ − ˆ V π θ − ˆ A w π θ i = 0 . (39) Since Q π θ is at least twice differentiable almost e verywhere, its second-order expansion in a at e = 0 is valid almost ev erywhere, i.e. Q π θ ( s , a ) =  V π θ + ∇ a Q > π θ ( a − π θ ) + ξ  s , a = π θ = V π θ ( s ) + ∇ a A π θ ( s , π θ ) > e + ξ , (40) where ξ is the second-order remainder of the T aylor expansion of Q π θ , and where we use ∇ a Q π θ = ∇ a A π θ , see (11). Because Q π θ is twice differentiable almost everywhere and using 1c., ξ is of order O ( k e k 2 ) almost e verywhere. W e then observe that (39) becomes: E π σ θ h ∇ θ π θ M ( e − c ) e >  ∇ a A π θ − ∇ a ˆ A w π θ i (41) + E π σ θ [ ∇ θ π θ M ( e − c ) ξ ] + E π σ θ h ∇ θ π θ M ( e − c )  V π θ − ˆ V π θ i = 0 . Function ξ is of second-order or more in e and using the second assumption, it is polynomially bounded. Moreover , using assumption 1d., and using arguments from the Delta method [11], we can conclude that: lim σ → 0 1 σ E [ ∇ θ π θ M ( e − c ) ξ | s ] = 0 , ∀ s (42) holds. It follows that the second term of (41) asymptotically vanishes faster than σ . Moreover , (37a) guarantees that the third term of (41) also asymptotically vanishes faster than σ . It follows that lim σ → 0 1 σ E π σ θ h ∇ θ π θ M ( e − c ) e >  ∇ a A π θ − ∇ a ˆ A w π θ i = 0 . (43) 6 Using (37) we observe that using (37a): lim σ → 0 1 σ E  M ( e − c ) e >  (44) = lim σ → 0 1 σ M  Σ e + η e η > e − c η > e  = lim σ → 0 1 σ M Σ e . W e ﬁnally conclude that (39) and (43) with (35) entail that lim σ → 0 1 σ E π σ θ h ∇ θ π θ M Σ e  ∇ a A π θ − ∇ a ˆ A w π θ i (45) = E π θ h ∇ θ π θ  ∇ a A π θ − ∇ a ˆ A w π θ i = 0 . where assumption (7) yields asymptotically the equi valence between E π σ θ [ . ] and E π θ [ . ] . Equation (38) follows. W e no w turn to proposing a computationally effecti ve method to generate a safe exploration and to compute mean and cov ariance estimations for e , i.e., a matrix M ( s ) and vector c ( s ) that satisfy conditions (37). V . O P T I M I Z A T I O N - BA S E D S A F E E X P L O R A T I O N In this section we proposed a modiﬁcation of (20) allowing one to build a stochastic polic y π σ θ that produces safe inputs for exploration, and for which the corrections M and c satisfying Assumption 1e. are cheap to compute. The proposed approach will use the primal-dual interior-point method and techniques from parametric Nonlinear Programming. T o that end we will consider inputs a = u d 0 ( s , θ , d ) generated from: u d ( s , θ , d ) = arg min u Φ d ( x , u , θ , d ) (46a) s . t . f ( x , u , s , θ ) = 0 , (46b) h ( x , u , θ ) ≤ 0 , (46c) where Φ d ( u , s , θ , d ) is a modiﬁed version of the cost function Φ in (20), and d ∈ R n a is drawn from a Normal, centred probability distribution of density: d ∼ N (0 , σ Σ ( s )) (47) of cov ariance σ Σ( s ) , where Σ (possibly) depends on s . The stochastic polic y π σ θ will then result from (46)-(47). A simple choice for Φ d ( x , u , θ , d ) is a gradient disturbance: Φ d ( u , s , θ , d ) = Φ( x , u , θ , d ) + d > u 0 . (48) One can verify that a = u d 0 ( s , θ , d ) ∈ S ( s ) by construction, such that the exploration is safe. Deploying the principles detailed in Section II-B and Propo- sition 1 requires one to form at each time step asymptotically accurate estimations c , M of the mean η e and cov ariance Σ e . For e restricted to generate inputs in a non-tri vial safe set S ( s ) , estimating η e and Σ e requires in general sampling the distribution of e generated by (46)-(47), which is unfortunately computationally expensi ve as a large number of sample is required and each sample requires solving (46). An alternati ve to estimating η e and Σ e using sampling is to form these estimations via a T aylor expansion of u d 0 ( s , θ , d ) in d . Unfortunately , u d 0 ( s , θ , d ) is in general non-smooth due to the presence of inequality constraints in (46). T o alleviate this problem, in this section, we propose to cast (46) in a primal-dual interior point formulation, i.e., we consider that the solutions of (46) are obtained from solving the relaxed Karush-Kuhn-T ucker (KKT) conditions [7]: r τ ( z , θ , d ) =   ∇ w Φ d + ∇ w h µ + ∇ w f λ f diag( µ ) h + τ   = 0 , (49) for τ > 0 , and under the conditions h < 0 and µ > 0 . Here µ , λ are the multipliers associated to the equality and inequality constraints in (46), and we label w = { x , u } and z = { w , λ , µ } the primal-dual variables of (49). W e will label u τ ( s , θ , d ) the parametric primal solution of (49). The error between the true solution of (46) and the one deliv ered by solving (49) is of the order of the relaxation pa- rameter τ , and the solution u τ ( s , θ , d ) is guaranteed to satisfy the constraints of (46) for all τ ≥ 0 , hence (49) deliv ers safe policies if (46) does. The relaxed KKT conditions (49) yield a smooth function u τ ( s , θ , d ) , such that its T aylor expansion is well-deﬁned e verywhere. The relaxed KKT conditions (49) will be used next to generate the safe exploration, and the deterministic policy: π τ θ = u τ 0 ( s , θ , 0 ) . (50) A. Covariance and mean estimators For the sake of clarity , let us use the short notation g ( s , θ , d ) := u τ 0 ( s , θ , d ) . Hence g is ev aluated by solving (49) and extracting the ﬁrst control input u τ 0 ∈ R n a . This function will be instrumental in building cheap mean and cov ariance estimators satisfying (37). Let us provide these estimators in the following Proposition. Pr oposition 2: If u d ( s , θ , d ) arising from NLP (46) is polynomially bounded in d , then the following mean and cov ariance estimators: c = 1 2 n a X i,j =1 ∂ 2 g ∂ d i d j Σ ij , (51a) M = ∂ g ∂ d Σ ∂ g ∂ d > ! − 1 d =0 (51b) satisfy conditions (37), where Σ is used in (47). Pr oof: W e observe that: η e = E [ g ( s , θ , d ) − g ( s , θ , 0)] (52) = E   ∂ g ∂ d     d =0 d + 1 2 n a X i,j =1 ∂ 2 g ∂ d i d j d i d j + ς   , where ς is the third-order remainder of the expansion of g . W e observe that since d has zero mean (52) becomes: η e = σ 2 n a X i,j =1 ∂ 2 g ∂ d i d j Σ ij + E [ ς ] . (53) W e also observe that E [ ς ] = O ( σ 2 ) holds using arguments from the Delta method [11]. It follows that (51a) satisﬁes (37a). Furthermore, we observe that Σ e = E  ee >  − η e η > e , (54) 7 Fig. 2: Illustration of the mean and covariance (32) of the exploration (31) subject to safety constraints (solid red line here), generated by the interior- point approach (49) for two value of the relaxation parameter τ . The policy is generated by (33), and the exploration by (46)-(48), with Σ = I . The mean estimator c (+ symbol) and cov ariance estimator M − 1 (dashed ellipsoid) (51) are compared to the ones estimated by sampling (o symbol, solid line ellipsoid). and that: E  ee >  = σ ∂ g ∂ d Σ ∂ g ∂ d >      d =0 + O ( σ 2 ) , (55) holds using similar arguments as for (52)-(53). It follo ws that: lim σ → 0 1 σ M Σ e = ∂ g ∂ d Σ ∂ g ∂ d > ! − 1 ∂ g ∂ d Σ ∂ g ∂ d > ! = I , (56) where the Jacobians are ev aluated at d = 0 . Note that deploying (51b) requires the Jacobian ∂ g ∂ d ∈ R n a × n a to be full rank. This Jacobian also appears in [10] to develop the stochastic policy gradient counterpart of this paper , and its rank is in vestigated. W e will not repeat in detail this analysis here, but let us recall its conclusion: for the choice of cost function (48), the Jacobian ∂ g ∂ d is full rank for any τ > 0 if the NLP (46) satisﬁes the Linearly Independent Constraint Qual- iﬁcation (LICQ) and the Second-Order Sufﬁcient Condition (SOSC). Howe ver , ∂ g ∂ d can tend to a rank deﬁcient matrix for τ → 0 if u d 0 deliv ered by (46) activ ates some of the inequality constraints (46c). Similarly to what has been reported in [10], this issue disappears in some speciﬁc cases, which are discussed in the following Proposition, and illustrated in Fig. 3 below . Pr oposition 3: For the choice of cost function (48), and if the MPC model dynamics and constraints are not depending on the parameters, i.e. ∇ θ h = 0 , ∇ θ f = 0 , then the choice of M , c proposed by (51) yields (38) for τ → 0 if problem (46) fulﬁls LICQ and SOSC. Pr oof: W e will prov e this statement in an acti ve-set setting deployed on (46), with (51b) evaluated via a pseudo- in verse. The statement of the Proposition will then hold from the conv ergence of the Interior-Point solution to the activ e-set one. W e will then inv estigate (44)-(45) in that context. Using (51b) with a pseudo-in verse, and using similar developments as in Proposition (1), one can verify that: lim σ → 0 1 σ E  ∇ θ π θ M ( e − c ) e >  (57) = ∇ θ π θ ∂ g ∂ d Σ ∂ g ∂ d > ! + ∂ g ∂ d Σ ∂ g ∂ d > ! = ∂ g ∂ d Σ ∂ g ∂ d > ! + ∂ g ∂ d Σ ∂ g ∂ d > ! ∂ g ∂ θ . W e will then prove that under the assumptions of this Propo- sition, ∂ g ∂ θ is in the range space of ∂ g ∂ d Σ ∂ g ∂ d > , such that ∂ g ∂ d Σ ∂ g ∂ d > ! + ∂ g ∂ d Σ ∂ g ∂ d > ! ∂ g ∂ θ = ∂ g ∂ θ (58) holds. T o that end, consider A the (strictly) activ e set of (46), i.e., the set of indices i such that h i = 0 , µ i > 0 at the solution. W e observe that  H ∇ w q ∇ w q > 0   ∂ w ∂ d ∂ ν ∂ d  = −  ∇ wd Φ d 0  , (59) where H is the Hessian of the Lagrange function associated to (46) and q =  f h A  , ν =  λ µ A  . (60) Deﬁning N A the null space of ∇ w q > , i.e. ∇ w q > N A = 0 , we observe that: ∂ g ∂ d = −N A 0  N > A H N A  − 1 N > A 0 , (61) where N A 0 =  I n a × n a 0 . . . 0  N A . Using a similar reasoning, and since ∂ ∇ w q ∂ θ = 0 , we observe that: ∂ g ∂ θ = −N A 0  N > A H N A  − 1 N > A ∇ w θ Φ , (62) such that ∂ g ∂ θ is in the range space of ∂ g ∂ d . As a result, for Σ full rank ∂ g ∂ θ is in the range space of ∂ g ∂ d Σ ∂ g ∂ d > , such that (58) holds. Using (51b) deﬁned via the pseudo-in verse, and (44), (53)-(55) one can observe that: lim σ → 0 1 σ E  ∇ θ π θ M ( e − c ) e >  (63) = E " ∇ θ π θ M ∂ g ∂ d Σ ∂ g ∂ d > !# holds. Using (58), we ﬁnally observe that lim σ → 0 1 σ E  ∇ θ π θ M ( e − c ) e >  = ∇ θ π θ . (64) W e can then conclude that (45) holds and that the policy gradient estimator is exact, i.e. (38) holds. W e need to cav eat here the practical implications of Propo- sition 3. First, the results hold for τ → 0 , with σ → 0 . If using matrix M deﬁned via the classic inv erse (51a), the results of Proposition 3 hold in the sense that for any τ , (64) holds asymptotically for σ suf ﬁciently small. Hence reducing τ may require reducing σ for (64) to hold. Alternati vely , M ought to 8 Fig. 3: Illustration of Proposition 3 and the following discussion for the small problem (33). The solid black arrows represent the directions spanned by ∇ θ π θ . The red, dashed-line arrows report the corresponding terms in 1 σ E  ∇ θ π θ M ( e − c ) e >  appearing in (64). The dotted-line blue arrows report the directions spanned by ∂ g ∂ d . One can see that (64) holds for τ > 0 (left graph), and holds for parameters θ 1 , 2 for τ → 0 (right graph) as they satisfy the assumptions of Proposition 3. Howev er , (64) does not hold for θ 3 , as it inﬂuences the constraint (33b), and therefore violates the assumptions of Proposition 3 (right graph). One can construe the problem as a lack of exploration (blue dotted-line arrows) in the direction ∇ θ 3 π θ due to the active constraint. be systematically deﬁned via a pseudo-in verse. Unfortunately , the deﬁnition of M then becomes somewhat arbitrary and non- smooth. Additionally , one ought to observe that the assumptions of Proposition 3 are fairly restrictive, as they do not allo w one to adjust the model or constraints in the robust MPC scheme, which leav es only the cost function as subject to adaptation. While [9] shows that it is theoretically enough to adapt only the MPC cost function to generate the optimal control policy from the MPC scheme, this result requires a rich parametrization of the cost, which may be undesirable. When the model and/or constraints of the NMPC scheme are meant to be adjusted by the RL algorithm, such that the assumptions of Proposition 3 are not satisﬁed, then the policy gradient can be incorrect. The issue is associated to parameters θ that can (locally) move the policy in directions orthogonal to the strictly activ e constraints. Indeed, for τ → 0 , (46)-(48) yield samples that are (for σ → 0 ) in the span of ∂ g ∂ d , which is rank deﬁcient when π θ strictly activ ates some constraints. Howe ver , if the assumptions of Proposition 3 are not satisﬁed, ∂ g ∂ θ can span directions that are in the null space of ∂ g ∂ d , and therefore not explored. It follows that the policy gradient can be wrong in these directions. These observ ations are illustrated in Fig. 3. For cases that do not satisfy the assumptions of Proposi- tion 3, working with τ > 0 (although possibly small) appears to be the best option. W e ought to underline here that while the corrections M and c are in theory needed in order to build a correct policy gradient estimation (38), the error in the policy gradient estimation resulting from not using these corrections is yet to be inv estigated in detail. While Σ in (47) can in principle be chosen freely , a reasonable option is to adopt Σ = I , i.e., an isotropic gradient disturbance, in which case M = ∂ g ∂ d ∂ g ∂ d > ! − 1 d =0 . (65) W e turn now to detailing how (51) can be e v aluated at low computational expenses. B. Implementation & Sensitivity computation In order to compute the sensitivities required in (51) to ev aluate c and M , the ﬁrst and second-order sensitivities of g are required. In turn, this requires one to ev aluate the sensitivities of the relaxed KKTs (49). In this section we detail how this can be done. W e ﬁrst observe that if LICQ and SOSC hold [15] for the NLP (46), then ∂ r τ ∂ z is full rank, and the Implicit Function Theorem (IFT) guarantees that one can e valuate the ﬁrst-order sensitivities of (49) by solving the linear equations: ∂ r τ ∂ z ∂ z ∂ d + ∂ r τ ∂ d = 0 , ∂ r τ ∂ z ∂ z ∂ θ + ∂ r τ ∂ θ = 0 , (66) for ∂ z ∂ d and ∂ z ∂ θ . One can then readily obtain ∂ g ∂ d and ∂ g ∂ θ by extracting the ﬁrst n a rows of ∂ z ∂ d , ∂ z ∂ θ . The Jacobian ∂ g ∂ θ the deliv ers ∇ θ π τ θ = ∂ g ∂ θ > , (67) required in (35), while ∂ g ∂ d is required in (51b). The second-order term ∂ 2 g ∂ d i d j needed in (51a) can be ob- tained from solving the second-order sensitivity equation of the NLP: ∂ r τ ∂ z ∂ 2 z ∂ d i ∂ d j + ∂ 2 r τ ∂ d i ∂ z + X k ∂ 2 r τ ∂ z ∂ z k ∂ z k ∂ d i ! ∂ z ∂ d j + ∂ 2 r τ ∂ d i ∂ d j + X k ∂ 2 r τ ∂ d j ∂ z k ∂ z k ∂ d i = 0 , (68) for ∂ 2 z ∂ d i ∂ d j . The sensitivity ∂ 2 g ∂ d i d j is then obtained by extract- ing the ﬁrst n a rows of ∂ 2 z ∂ d i ∂ d j . Note that for computational efﬁcienc y , (68) is best treated as a tensor . W e should under- line here that computing the sensitivities is typically fairly inexpensi ve, if using an adequate algorithmic. V I . S A F E R L S T E P S F O R RO B U S T L I N E A R M P C The methodology described so far allows one to deploy a safe policy and safe exploration using a robust NMPC scheme in order to compute the deterministic policy gradient, and determine directions in the parameter space θ that improve the closed-loop performance of the NMPC scheme. Ho wev er , taking a step in θ can arguably jeopardize the safety of the NMPC scheme itself, e.g., by modifying the constraints, or the models underlying the scenario tree. The problem of modifying the NMPC parameters while maintaining safety is arguably a complex one, and beyond the scope of this paper . Howe ver , in this section, we propose a practical approach to handle this problem in a data-dri ven context. In this paper , 9 we propose an approach readily applicable to the linear robust MPC case, see Section III-B. When the dispersion set X + ( s , a ) can only be inferred from data, condition (26) arguably translates to (30). Condition (30) translates into a condition on the admissible parameters θ , i.e., it speciﬁes the parameters that are safe with r espect to the data observed so far . Condition (30) tests whether the points s k +1 − F 0 ( s k , a k , θ ) are in the polytope W , which can be easily translated into a set of algebraic constraints imposed on θ . W e observe that a classic gradient step of step-size α > 0 reads as: θ = θ − − α \ ∇ θ J ( π θ ) , (69) where θ − is the previous vector of parameters. One can observe that the gradient step can be construed as the solution of the optimization problem: min θ 1 2 k θ − θ − k 2 + α \ ∇ θ J ( π θ ) > ( θ − θ − ) . (70) Imposing (30) on the gradient step generating the ne w parame- ters can then be cast as the following constrained optimization problem: min θ , ϑ 1 2 k θ − θ − k 2 + α \ ∇ θ J ( π θ ) > ( θ − θ − ) (71a) s . t . s k +1 − F 0 ( s k , a k , θ ) − V X i =1 N D X k =0 ϑ i,k W i = 0 , (71b) V X i =1 ϑ i,k = 1 , ∀ k = 0 , . . . , N D , (71c) ϑ i,k ≥ 0 ∀ k = 0 , . . . , N D , i = 1 , . . . , V , (71d) where (71b)-(71d) are the algebraic conditions testing (30). W e observe that unfortunately the complexity of (71) gro ws with the amount of data N D in use. In practice, the data set D should arguably be limited to incorparate relev ant state transitions. A data compression technique has been proposed in [20] to alleviate this issue in the case the nominal model F 0 is ﬁxed. Future work will improve on this baseline. V I I . I M P L E M E N TA T I O N & I L L U S T R A T I V E E X A M P L E In this section, we pro vide some details on ho w the principle presented in this paper can be implemented, and provide an illustrative example of this implementation. At each time instant k , for a giv en state s k , the deterministic policy π θ is computed according to (49) with d = 0 . The solution is used to b uild M and c . The exploration is then generated according to (49) with d drawn from (47). W e ought to underline here that, unfortunately , the NLP has to be solved twice. The data are then collected to perform the estimations (34) and (35) either on-the-ﬂy or in a batch fashion. The policy gradient estimation (35) is then used to compute the safe parameter update according to (71). A. RL appr oach In the example below , a batch RL method has been used. The policy gradient was e valuated using batch Least-Squares T emporal-Difference (LSTD) techniques, whereby for each ev aluation, the closed-loop system is run S times for N t time steps, generating S trajectory samples of duration N t . The value function estimations is constructed using: N t X k =0 S X i =1 δ V ( s k,i , a k,i , s k +1 ,i ) ∇ v ˆ V v π θ ( s k,i ) = 0 (72a) δ V := L ( s k,i , a k,i ) + γ ˆ V v π θ ( s k +1 ,i ) − ˆ V v π θ ( s k,i ) (72b) and based on a linear value function approximation ˆ V v π θ ( s ) = % ( s ) > v . (73) A simple fully parametrized quadratic function in s to b uild ˆ V v π θ in the example below . Using the parameters v obtained from (72), the advantage function estimation is given by: N t X k =0 S X i =1 δ Q ( s k,i , a k,i , s k +1 ,i ) ∇ w ˆ Q w π θ ( s k,i , a k,i ) = 0 , (74a) δ Q := L ( s k,i , a k,i ) + γ ˆ V v π θ ( s k +1 ,i ) − ˆ Q w π θ ( s k,i , a k,i ) , (74b) ˆ Q w π θ ( s k,i , a k,i ) = ˆ V v π θ ( s k,i ) + ˆ A w π θ ( s k,i , a k,i ) , (74c) where ˆ A w π θ is based on (36). W e observe that both (72) and (74) are linear in the parameters v and w , and therefore straightforward to solve. Howe ver , they can be ill-posed on some data sets, and they ought to be solved using, e.g., a Moore-Penrose pseudo-in verse, preferably with a reasonably large saturation of the lo west singular value. The policy gradient estimation is then obtained from (35), using: \ ∇ θ J ( π θ ) = N t X k =0 S X i =1 ∇ θ π θ ( s k,i ) M ( s k,i ) ∇ θ π θ ( s k,i ) > w . (75) B. Robust linear MPC scheme While the proposed theory is not limited to linear problems, for the sake of clarity , we propose to use a fairly simple robust linear MPC example using multiple models and process noise. W e will consider the policy as delivered by the following robust MPC scheme based on multiple models and a linear feedback policy: min u , x N M X j =0 k x j,N − ¯ x k 2 + N − 1 X k =0      x j,k − ¯ x u j,k − ¯ u      2 ! (76a) s . t . x j,k +1 = A 0 x j,k + B 0 u j,k + b 0 + W j , (76b) k x j,k k 2 ≤ 1 , ∀ j = 0 , . . . , N M , k = 1 , . . . N , (76c) x j, 0 = s , ∀ j = 1 , . . . , N M , (76d) u j, 0 = u k, 0 , ∀ k , j = 0 , . . . , N M , (76e) u j,k = u 0 ,k − K ( x j,k − x 0 ,k ) , j = 1 , . . . , N M , (76f) where A 0 , B 0 , b 0 yield the MPC nominal model correspond- ing to F 0 , with W 0 = 0 , and W 1 ,...M capture the v ertices of the dispersion set outer approximation. Hence model j = 0 serves as nominal model and models j = 1 , . . . , N M capture the state dispersion ov er time. The linear feedback matrix K is possibly part of the MPC parameters θ , and is a (rudimentary) structure providing a feedback π s as described in Section II-C. In practice, (76) is equiv alent to a tube-based MPC. 10 C. Simulation setup & results The simulations proposed here use the same setup as the companion paper [10] treating the stochastic policy gradient case, so as to make comparisons straightforward. The experi- mental parameters are summarized in T ab. I and: x k +1 = A real x k + B real u k + n , (77) where the process noise n is selected Normal centred, and clipped to a ball. The real system was selected as: A real = κ  cos β sin β sin β cos β  , B real =  1 . 1 0 0 0 . 9  . (78) The real process noise n is chosen normal centred of cov ari- ance 1 3 10 − 2 I , and restricted to a ball of radius 1 2 10 − 2 . The initial nominal MPC model is chosen as: A 0 =  cos ˆ β sin ˆ β sin ˆ β cos ˆ β  , B 0 =  1 0 0 1  , b 0 =  0 0  . (79) and N M = 4 with: W 1 = 1 10  − 1 − 1  , W 2 = 1 10  +1 − 1  (80a) W 3 = 1 10  +1 +1  , W 4 = 1 10  − 1 +1  . (80b) T ABLE I: Simulation parameters Parameter V alue Description γ 0.99 Discount factor Σ I Exploration shape σ 10 − 3 Exploration cov ariance τ 10 − 2 Relaxation parameter β 22 ◦ Real system parameter ˆ β 20 ◦ Model parameter N t 20 Sample length S 30 Number of sample per batch N 10 MPC prediction horizon The baseline stage cost is selected as: L = 1 20 k x − x ref k 2 + 1 2 k u − u ref k 2 (81) and serves as the baseline performance criterion to ev aluate the closed-loop performance of the MPC scheme. W e considered two cases, using deterministic initial con- ditions s 0 =  cos 60 ◦ sin 60 ◦  > . Both cases consider the parameters θ = { ¯ x , ¯ u , A 0 , B 0 , b 0 , K, W } . The ﬁrst case considers a stable real system with κ = 0 . 95 , the second case considers an unstable real system with κ = 1 . 05 . In both cases, the target reference ¯ x was provided, together with the input reference ¯ u deliv ering a steady-state for the nominal MPC model. The feedback matrix K was chosen as the LQR controller associated to the MPC nominal model. T able I reports the algorithmic parameters. Case 1 used a step size α = 0 . 05 , the second case used a step size α = 0 . 01 . The results for the ﬁrst case are reported in Figures 4-8. One can observe in Fig. 4 that the closed-loop performance is improv- ing ov er the RL steps. Fig. 5 shows that the improv ement takes place via driving the closed-loop trajectories of the real Fig. 4: Case 1. Evolution of the closed-loop performance J over the RL steps. The solid line represents the estimation of J based on the samples obtained in the batch. The dashed line represent the standard deviation due to the stochasticity of the system dynamics and policy disturbances. Fig. 5: Case 1. Closed-loop system trajectories. The initial conditions s 0 are reported, as well as the target state reference x ref (circle), and the MPC reference ¯ x at the ﬁrst RL step and at the last one (grey and black + symbol respectiv ely). The trajectories at the ﬁrst and last RL steps are reported as the light and dark grey polytopes. The solid black curve represents the state constraint k x k 2 ≤ 1 . system closer to the reference, without jeopardising the system safety . Fig. 6 shows how the RL algorithm uses the MPC nominal model to improve the closed-loop performance. One can readily see from Fig. 6 that RL is not simply performing system identiﬁcation, as the nominal MPC model dev eloped by the RL algorithm does not tend to the real system dynamics. Fig. 7 sho ws how the RL algorithm reshapes the dispersion set. The upper-left corner of the set is the most critical in terms of performance, as it activ ates the state constraint k x k 2 ≤ 1 , and is mov ed inward to gain performance. The constrained RL step (71) ensures that the RL algorithm cannot jeopardize the system safety . In Fig. 8, one can see that the RL algorithm does not use much the degrees of freedom provided by adapting the MPC feedback matrix K . The results for case 2 are reported in Figures 9-13. Similar comments hold for case 2 as for case 1. The instability of the real system does not challenge the proposed algorithm, ev en though a smaller step size σ had to be used as the RL algorithm appears to more sensitive to noise. 11 Fig. 6: Case 1. Evolution of the nominal MPC model ov er the RL steps. W e report here the difference between the nominal model used in the MPC scheme and the real system. Fig. 7: Case 1. Evolution of the MPC model biases W 1 ,...M over the RL steps. The light grey polytope depicts the biases at the ﬁrst RL step. and the points show s k +1 − F 0 ( s k , a k , θ ) for all the samples of the ﬁrst batch of data. The + symbol reports the initial nominal model offset b 0 . The cloud of point is inside the grey quadrilateral thanks to the constrained RL step (71). The same objects are represented in black for the last step of the learning process. Fig. 8: Case 1. Evolution of the MPC feedback matrix K from its initial value. The feedback is only mar ginally adjusted by the RL algorithm. After 100 RL steps, the adaptation of the feedback gain K has not yet reached its steady-state v alue. Fig. 9: Case 2, similar to Fig. 4. Fig. 10: Case 2, similar to Fig. 5 Fig. 11: Case 2, similar to 6. Fig. 12: Case 2, similar to Fig. 7. 12 Fig. 13: Case 2, similar to Fig. 8. V I I I . C O N C L U S I O N This paper proposed a technique to deploy determinis- tic policy gradient methods using a constrained parametric optimization problem as a support for the optimal policy approximation. This approach allows one to impose strict safety constraints on the resulting policy . In particular, ro- bust Nonlinear Model Predictiv e Control, where safety re- quirements can be imposed explicitly , can be selected as a parametric optimization problem. Imposing restrictions on the policy approximation creates some technical challenges when generating the e xploration required to form the policy gradient. Computationally inexpensi ve methods are proposed here to tackle these challenges, using interior-point techniques when solving the parametric optimization problem. The speciﬁc case of robust Model Predicti ve Control, where the prediction model is linear, is further developed, and a methodology to impose safety requirements throughout the learning process is proposed. The proposed techniques are illustrated in simple simulations, showing their behavior . This paper has a com- panion paper [10] inv estigating the stochastic policy gradient approach in the same context as in this paper . In the simula- tions performed here, the stochastic policy gradient approach of [10] appears to be computationally more expensiv e than the approach proposed here. R E F E R E N C E S A N D N O T E S [1] Pieter Abbeel, Adam Coates, Morgan Quigley , and Andre w Y . Ng. An application of reinforcement learning to aerobatic helicopter ﬂight. In In Advances in Neural Information Pr ocessing Systems 19 , page 2007. MIT Press, 2007. [2] D. Bernardini and A. Bemporad. Scenario-based model predictiv e control of stochastic constrained linear systems. In Proceedings of the 48h IEEE Confer ence on Decision and Contr ol (CDC) held jointly with 2009 28th Chinese Control Conference , pages 6333–6338, Dec 2009. [3] D. Bertsekas. Dynamic Pr ogramming and Optimal Control , v olume 2. Athena Scientiﬁc, 3rd edition, 2007. [4] D.P . Bertsekas. Dynamic Pr ogramming and Optimal Control , volume 1 and 2. Athena Scientiﬁc, Belmont, MA, 1995. [5] D.P . Bertsekas and I.B. Rhodes. Recursiv e state estimation for a set- membership description of uncertainty. IEEE T ransactions on A utomatic Contr ol , 16:117–128, 1971. [6] D.P . Bertsekas and S.E. Shrev e. Stochastic Optimal Contr ol: The Discr ete Time Case . Athena Scientiﬁc, Belmont, MA, 1996. [7] Lorenz T . Biegler . Nonlinear Progr amming . MOS-SIAM Series on Optimization. SIAM, 2010. [8] T. Ensslin. Information Field Theory . arXiv:1301.2556 [astro-ph.IM], 2013. [9] S. Gros and M. Zanon. Data-Dri ven Economic NMPC using Reinforce- ment Learning. IEEE T ransactions on Automatic Contr ol , 2018. (in press). [10] S. Gros and M. Zanon. T owards Safe Reinforcement Learning Using NMPC and Policy Gradients - Stochastic case (Part I). IEEE T ransac- tions on Automatic Contr ol , 2019. (submitted). [11] Oehlert G.W . A note on the delta method. The American Statistician , 46(1), 1992. [12] J. Fernandez J. Garcia. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Resear ch , 16:1437–1480, 2013. [13] I. Kolmano vsky and E.G. Gilbert. Theory and computation of distur- bance in variant sets for discrete-time linear systems. Math. Probl. Eng. , 4(4):317–367, 1998. [14] David Q. Mayne. Model predicti ve control: Recent de velopments and future promise. Automatica , 50(12):2967 – 2986, 2014. [15] J. Nocedal and S.J. Wright. Numerical Optimization . Springer Series in Operations Research and Financial Engineering. Springer , 2 edition, 2006. [16] P . O. M. Scokaert and D. Q. Mayne. Min-max feedback model predictiv e control for constrained linear systems. IEEE T ransactions on Automatic Contr ol , 43:1136–1142, 1998. [17] David Silver , Guy Lever , Nicolas Heess, Thomas Degris, Daan W ierstra, and Martin Riedmiller . Deterministic policy gradient algorithms. In Pr oceedings of the 31st International Conference on Machine Learning , ICML ’14, pages I–387–I–395, 2014. [18] Richard S. Sutton, David McAllester , Satinder Singh, and Y ishay Man- sour . Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Pr ocessing Systems , NIPS’99, pages 1057–1063, Cambridge, MA, USA, 1999. MIT Press. [19] Shouyi W ang, W anpracha Chaovalitw ongse, and Robert Babuska. Ma- chine learning algorithms in bipedal robot control. T rans. Sys. Man Cyber P art C , 42(5):728–743, September 2012. [20] M. Zanon and Gros. Safe Reinforcement Learning Using Robust MPC. In T ransaction on Automatic Contr ol, Arc hivx , 2019. (submitted). S ´ ebastien Gros received his Ph.D degree from EPFL, Switzerland, in 2007. After a journey by bicycle from Switzerland to the Everest base camp in full autonomy , he joined a R&D group hosted at Strathclyde Univ ersity focusing on wind turbine control. In 2011, he joined the university of KU Leuven, where his main research focus was on op- timal control and fast NMPC for complex mechan- ical systems. He joined the Department of Signals and Systems at Chalmers Univ ersity of T echnology , G ¨ oteborg in 2013, where he became associate Prof. in 2017. He is now full Prof. at NTNU, Norway and guest Prof. at Chalmers. His main research interests include numerical methods, real-time optimal control, reinforcement learning, and the optimal control of energy-related applications. Mario Zanon received the Master’ s degree in Mechatronics from the University of Trento, and the Dipl ˆ ome d’Ing ´ enieur from the Ecole Centrale Paris, in 2010. After research stays at the KU Leuven, Univ ersity of Bayreuth, Chalmers Uni versity , and the Univ ersity of Freibur g he receiv ed the Ph.D. degree in Electrical Engineering from the KU Leuven in November 2015. He held a Post-Doc researcher position at Chalmers Univ ersity until the end of 2017 and is now Assistant Professor at the IMT School for Advanced Studies Lucca. His research interests include numerical methods for optimization, economic MPC, optimal control and estimation of nonlinear dynamic systems, in particular for aerospace and automotiv e applications.

Towards Safe Reinforcement Learning Using NMPC and Policy Gradients: Part II - Deterministic Case

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment