Towards Safe Reinforcement Learning Using NMPC and Policy Gradients: Part I - Stochastic case

1 T o wards Safe Reinforcement Learning Using NMPC and Polic y Gradients: Part I - Stochastic case S ´ ebastien Gros, Mario Zanon Abstract —W e present a methodology to deploy the stochastic policy gradient method, using actor-critic techniques, when the optimal policy is approximated using a parametric optimization problem, allowing one to enfor ce safety via hard constraints. For continuous input spaces, imposing safety r estrictions on the stochastic policy can make the sampling and ev aluation of its density difﬁc ult. This paper pr oposes a computationally effective approach to solve that issue. W e will focus on policy approximations based on rob ust Nonlinear Model Predicti ve Control (NMPC), where safety can be treated explicitly . For the sake of brevity , we will detail safe policies in the robust linear MPC context only . The extension to the nonlinear case is possible but more complex. W e will additionally present a technique to maintain the system safety thr oughout the learning process in the context of rob ust linear MPC. This paper has a companion paper treating the deterministic policy gradient case. Index T erms —Safe Reinf orcement Lear ning, r obust Model Pr e- dictive Control, stochastic policy gradient, interior -point method. I . I N T RO D U C T I O N Reinforcement Learning (RL) is a po werful tool for tackling Markov Decision Processes (MDP) without depending on a detailed model of the probability distributions underlying the state transitions. Indeed, most RL methods rely purely on observed state transitions, and realizations of the stage cost L ( s , a ) ∈ R assigning a performance to each state-input pair s , a (the inputs are often labelled actions in the RL community). RL methods seek to increase the closed-loop performance of the control policy deployed on the MDP as observations are collected. RL has drawn an increasingly large attention thanks to its accomplishments, such as, e.g., making it possible for robots to learn to w alk or ﬂy without supervision [20], [1]. Most RL methods are based on learning the optimal control policy for the real system either directly , or indirectly . Indirect methods typically rely on learning a good approximation of the optimal action-v alue function underlying the MDP . The optimal policy is then indirectly obtained as the minimizer of the value-function approximation over the inputs a . Direct RL methods, based on polic y gradients, seek to adjust the parameters θ of a giv en policy π θ such that it yields the best closed-loop performance when deployed on the real system. An attracti ve adv antage of direct RL methods o ver indirect ones is that they are based on formal necessary conditions of optimality for the closed-loop performance of π θ , and there- fore asymptotically (for a large enough data set) guarantee the (possibly local) optimality of the parameters θ [19], [17]. S ´ ebastien Gros is with the Department of Cybernetic, NTNU, Norway . Mario Zanon is with the IMT School for Advanced Studies Lucca, Lucca 55100, Italy . RL methods often rely on Deep Neural Networks (DNN) to carry the policy approximation π θ . While effecti ve in practice, control policies based on DNNs provide limited opportunities for formal veriﬁcations of the resulting closed-loop behavior , and for imposing hard constraints on the ev olution of the state of the real system. The dev elopment of safe RL methods, which aims at tackling this issue, is currently an open ﬁeld or research [12]. In this paper, we in vestigate the use of constrained paramet- ric optimization problems to carry the policy approximation. The aim is to impose safety by means of hard constraints in the optimization problem. In that context, we inv estigate some straightforward options to build a safe stochastic policy , and discuss their shortcomings when using the stochastic polic y gradient method. W e then present an alternativ e approach, and propose tools to make its deployment computationally efﬁcient, using the using primal-dual interior -point method and techniques from parametric Nonlinear Programming. Robust Nonlinear Model Predictive Control (NMPC) is arguably an ideal candidate for forming the constrained opti- mization problem supporting the policy approximation. Robust NMPC techniques provide safety guarantees on the closed- loop behavior of the system by explicitly accounting for the presence of (possibly stochastic) disturbances and model inaccuracies. A rich theoretical framework is available on the topic [14]. The policy parameters θ then appear as param- eters in the NMPC model(s), cost function and constraints. Updates in the policy parameters θ are driven by the stochastic policy gradient method, increasing the NMPC closed-loop performance, and constrained by the requirement that the NMPC model inaccuracies are adequately accounted for in forming the robust NMPC scheme. For the sake of bre vity and simplicity , we will detail these questions in the speciﬁc linear robust MPC case. The extension to the nonlinear case is arguably possible, but more complex. This paper has a companion paper [11] treating the same problem in the context of the deterministic policy gradient approach. The two papers share some material and use sim- ilar techniques, but present very different theories. Addition- ally [21] discusses the management of safety in RL using tube- based techniques. The paper is structured as follows. Section II provides some background material. Section III details the safe deterministic policy we use to build up the stochastic polic y . Section IV in vestigates sev eral options to build safe stochastic policies from the deterministic policy approximation, discusses their shortcomings and proposes a computationally efﬁcient alter- nativ e based on disturbed parametric NLPs. Section V presents numerical tools for an efﬁcient deployment of the stochastic 2 policy gradient approach for the latter approach, using the primal-dual interior-point method and tools from parametric Nonlinear Programming. Section VI discusses a technique to ensure safety throughout the learning process, in the context of rob ust linear MPC. Section VII proposes an example of simulation using the principles developed in this paper . I I . B A C K G RO U N D O N M A R K OV D E C I S I O N P RO C E S S E S In the following, we will consider that the dynamics of the real system are described as a Markov Chain, with state tran- sitions having the underlying conditional probability density: P [ s + | s , a ] (1) denoting the probability density of the state transition from the state-input pair s ∈ R n , a ∈ R n a to a new state s + ∈ R n . W e will furthermore consider (possibly) stochastic policies π , taking the form of probability densities: π [ a | s ] (2) denoting the probability density of selecting a given input a when the system is in a gi ven state s . W e should note here that a deterministic policy a = π ( s ) (3) can always be cast as a stochastic policy (2) by deﬁning: π [ a | s ] = δ ( a − π ( s )) (4) where δ is the Dirac function. Let us then consider the distribution of the Markov Chain resulting from the state transition (1) and policy (2): P [ s k | π ] = Z k − 1 Y i =0 P [ s i +1 | s i , u ] P [ s 0 ] π [ a i | s i ] (5) d s 0 ,...,k − 1 d a 0 ,...,k − 1 where P [ s 0 ] denotes the probability distribution of the initial conditions s 0 of the MDP . W e can then deﬁne the discounted expected value under policy π , which reads as: E π [ ζ ( s , a )] = ∞ X k =0 Z γ k ζ ( s k , a k ) P [ s k | π ] π [ a k | s k ] d s k d a k (6) for any function ζ . This deﬁnition can be easily extended for functions over state transitions, i.e. ζ ( s + , s , a ) . In the following we will assume the local stability of the MDP under the selected policies. More speciﬁcally , we assume that π is such that: lim ˜ π → π E ˜ π [ ζ ] = E π [ ζ ] , (7) where the limit is taken in the sense of almost ev erywhere, and for any bounded function ζ such that both sides of the equality are ﬁnite. Assumption (7) will allow us to draw equiv alences between a policy and disturbances of that policy , which will be required in the RL context. For a gi ven stage cost function L ( x , u ) and a discount factor γ ∈ [0 , 1] , the performance of policy π is gi ven by the discounted cost: J ( π ) = E π [ L ] (8) The optimal policy associated to the MDP deﬁned by the state transition (1), the stage cost L and the discount factor γ is then giv en by: π ? = arg min π J ( π ) (9) It should be useful to underline here that, while (9) may hav e sev eral (global) solutions, any fully observ able MDP admits a deterministic policy π ? among its solutions. The value function associated to a given policy π is given by [4], [6], [3]: V π ( s ) = E a ∼ π [ . | x ] [ L ( s , a ) + γ E [ V π ( s + ) | s , a ]] , (10) where the internal expected value in (10) is taken ov er state transitions (1). A. Stochastic policy gradient In most cases, the optimal policy π ? cannot be computed. It is then useful to consider a stochastic approximations π θ of the optimal policy , carried by a (possibly large) set of parameters θ . The optimal parameters θ ? are then giv en by: θ ? = arg min θ J ( π θ ) (11) The policy gradient ∇ θ J ( π θ ) associated to the stochastic policy π θ can be obtained using various actor-critic methods, such as e.g. [18], [19]: ∇ θ J ( π θ ) = E π θ  ∇ θ log π θ δ V π θ  , (12) where δ V π θ = L ( s , a ) + γ V π θ ( s + ) − V π θ ( s ) (13) The v alue function V π θ in (13) is formally giv en by (10), but typically approximated via a parametrized value function approximation and computed via T emporal-Difference (TD) techniques or Monte-Carlo techniques [18]. Note that is is fairly common in RL to generate the stochastic policy π θ as a disturbed version of a deterministic policy π θ by , e.g., adding a simple stochastic disturbance to π θ . In order to deploy the stochastic policy , π θ needs to be sam- pled, to produce realizations of the inputs a to be deployed on the system. Moreover , in order to compute the policy gradient, ev aluations of the gradient of the policy score function: ∇ θ log π θ = π − 1 θ ∇ θ π θ (14) are required in computing (12). 3 B. Safe set In the following, we will assume the existence of a (possi- bly) state-dependent safe set labelled S ( s ) ⊆ R n a , subset of the input space. The notion of safe set will be used here in the sense that any input selected such that a ∈ S ( s ) yields safe future trajectories with a unitary probability . The construction of the safe set is not the object of this paper . Howe ver , we can nonetheless propose some pointer to how such a set is constructed in practice. Let us consider the constraints h s ( s , a ) ≤ 0 describing at any giv en future time i the subset of the state-input space deemed feasible and safe. Constraints h can include pure state constraints, describing the safe states, pure input con- straints, describing typically actuators limitations, and mixed constraints, where the states and inputs are mix ed. For the sake of simplicity , we will assume in the following that h s is con vex. A common approach to build practical or inner approxima- tions of the safe set S ( s ) is via verifying the safety of an input a explicitly ov er a ﬁnite horizon via predictive control techniques. This veriﬁcation is based on forming the support of the Markov Process distribution over time, starting from a giv en state-input pair s , a . Consider the set X + ( s , a ) , support of the state transition (1), X + ( s , a ) = { s + | P [ s + | s , a ] > 0 } (15) Labelling X k ( s , a , π s ) the support of the state of the Markov Process at time k , starting from s , a and ev olving under policy π s , the set X k is then giv en by the recursion: X k ( s , a , π s ) = X + ( X k − 1 , π s ( X k − 1 )) , (16) with the boundary condition X 1 = X + ( s , a ) . An input a is in the safe set S ( s ) if h s ( s , a ) ≤ 0 and if there exist a deterministic policy π s such that h s ( s k , π s ( s k )) ≤ 0 , ∀ s k ∈ X k ( s , a , π s ) , (17) for all k ≥ 1 . This veriﬁcation is typically performed in practice via tube-based approaches, polynomial chaos, or direct approximations of the set X k via e.g. ellipsoids or polytopes. In that conte xt, polic y π s is typically selected a priori to stabilize the system dynamics, and possibly optimized to minimize the size of the sets X k . C. Safe stochastic policy In this paper , we will consider safe, stochastic policies π θ , which we will label π θ [ a | s ] (18) W e will build π θ from a deterministic policy π θ based on a constrained optimization scheme such that the support of π θ is limited to the safe set S ( s ) , i.e. such that P [ a / ∈ S ( s ) | a ∼ π θ [ a | s ] ] = 0 (19) Unfortunately , when the support of the stochastic policy π θ must be restricted within the giv en, possibly non-tri vial safe set S ( s ) , it is not always straightforward to build a stochastic policy π θ that is at the same time inexpensiv e to sample from and to ev aluate. There are clearly several approaches to generate random inputs that are in the safe set S ( s ) with unitary probability , but we will focus here on techniques that require a limited amount of computations, so as to make them real-time feasible. I I I . O P T I M I Z A T I O N - B A S E D S A F E P O L I C Y In this paper , we will consider parametrized deterministic policies π θ ≈ π ? based on parametric optimization problems subject to safe stochastic disturbances. Before detailing the stochastic aspect, let us detail ﬁrst the constrained optimization problems. W e will consider parametrized deterministic policies π θ based on parametric Nonlinear Programs (NLPs), and more speciﬁcally based on robust NMPC schemes. This approach is formally justiﬁed in [10]. More speciﬁcally , we will consider a policy approximation π θ = u ? 0 ( s , θ ) , (20) where u ? 0 ( s , θ ) is the ﬁrst n a entries of u ? ( s , θ ) generated by the parametric NLP: u ? ( s , θ ) = arg min u Φ( x , u , θ ) (21a) s . t . f ( x , u , s , θ ) = 0 , (21b) h ( x , u , θ ) ≤ 0 . (21c) W e will then consider that the safety requirement π θ ( s ) ∈ S ( s ) is imposed via the constraints (21b)-(21c). A special case of (21) is an optimization scheme in the form: u ? 0 ( s , θ ) = arg min u 0 Φ( s , u 0 , θ ) (22a) s . t . h ( s , u 0 , θ ) ≤ 0 , (22b) where h ≤ 0 ought to ensure that π θ ( s ) = u ? 0 ( s , θ ) ∈ S ( s ) . While most of the discussions in this paper will take place around the general formulation (21), a natural approach to formulate constraints (21b)-(21c) such that policy (20) is safe is to build (21) using robust (N)MPC techniques. A. P olicy approximation based on r obust NMPC The imposition of safety constraints can be treated via robust NMPC approaches. Rob ust NMPC can take different forms [14], all of which can be e ventually cast in the form (21). One form of robust robust NMPC schemes is based on scenario trees [16], which take the form: u ? ( s , θ ) = arg min u N M X j =1 V j ( x j,N , θ ) + N − 1 X k =0 ` j ( x j,k , u j,k , θ ) ! (23a) s . t . x j,k +1 = F j ( x j,k , u j,k , θ ) , x j, 0 = s , (23b) h s ( x j,k , u j,k , θ ) ≤ 0 , (23c) e ( x j,N , θ ) ≤ 0 , (23d) N ( u ) = 0 , (23e) where F 1 ,...,N M are the N M different models used to support the uncertainty , while F 0 is a nominal model supporting the 4 NMPC scheme. Trajectories x j,k and u j,k for j = 1 , . . . , N M are the different models trajectories and the associated inputs. Functions ` 1 ,...,N M , V 1 ,...,N M the (possibly different) stage costs and terminal costs applying to the different models. The non-anticipativity constraints (23e) support the scenario-tree structure. F or a given state s and parameters θ , the NMPC scheme (23) deliv ers the input proﬁles u ? j ( s , θ ) =  u ? j, 0 ( s , θ ) , . . . , u ? j,N ( s , θ )  , (24) with u ? j,i ( s , θ ) ∈ R n a , and (23e) imposes u ? 0 ( s , θ ) := u ? i, 0 ( s , θ ) = u ? j, 0 ( s , θ ) , ∀ i, j. (25) As a result, the NMPC scheme (23) generates a parametrized deterministic policy according to: π θ ( s ) = u ? 0 ( s , θ ) ∈ R n a . (26) Policy π s is implicitly deployed in (23) via the scenario tree. If the dispersion set X + is known, the multiple models F 1 ,...,N M and terminal constraints (23d) can be chosen such that the robust NMPC scheme (23) deliv ers π θ ( s ) ∈ S ( s ) . Unfortu- nately , this selection can be difﬁcult in general. W e turn next to the robust linear MPC case, where this construction is much simpler . B. Safe r ob ust linear MPC Exhaustiv ely discussing the construction of the safe scenario tree in (23) for a giv en dispersion set X + ( s , a ) is beyond the scope of this paper . The process can be fairly in volved, and we refer to [16], [2] for detailed discussions. For the sake of brevity , we will focus on the linear MPC case, whereby the MPC models F 1 ,...,N M and policy π s are linear . Let us consider the following outer approximation of the dispersion set X + : X + ( s , a ) ⊆ F 0 ( s , a , θ ) + W , ∀ s , a (27) where we use a linear nominal model F 0 and a polytope W of vertices W 1 ,...,N M that can be construed as the e xtrema of a ﬁnite-support process noise, and which can be part (or functions of) the MPC parameters θ . F or the sake of simplicity , we assume that W is independent of the state-input pair s , a . The models F 1 ,...,N M can then be built based using: F i = F 0 + W i , i = 1 . . . N M (28) and using the linear policy: π s ( x j,k , u 0 ,k , x 0 ,k ) = u 0 ,k − K ( x j,k − x 0 ,k ) (29) where matrix K can be part (or function of) the MPC parameters θ . One can then verify by simple induction that: X k ( s , a , π s ) ⊆ Conv ( x 1 ,k , . . . , x N M ,k ) , (30) for k = 0 , . . . , N + 1 , where Conv is the con vex hull of the set of points x 1 ,k , . . . , x N M ,k solution of the MPC scheme (23). The terminal constraints (23d) ought then be constructed as, e.g., via the Robust Positiv e In v ariant set corresponding to π s in order to establish safety beyond the MPC horizon. For h s con vex, the MPC scheme (23) deliv ers safe inputs [14], [13]. Algorithm 1: Resampling Input: State s , conditional density % ( . | . ) Set sample = true while sample do Draw a ∼ %  . | π d θ ( s )  if a ∈ S ( s ) then sample = false retur n a When the dispersion set X + ( s , a ) can only be inferred from data, condition (27) arguably translates to [5]: s k +1 − F 0 ( s k , a k , θ ) ∈ W , ∀ ( s k +1 , s k , a k ) ∈ D , (31) where D is the set of N D observed state transitions. Condition (31) translates into a sample-based condition on the admissible parameters θ , i.e., it speﬁcies the parameters that are safe with r espect to the state transitions observed so far . Condition (31) tests whether the points s k +1 − F 0 ( s k , a k , θ ) are in the polytope W , which can be easily translated into a set of algebraic constraints imposed on θ . This observation will be used in Section III-B to build a safe RL-based learning. W e ought to underline here that building F 0 , W based on (31) ensures the safety of the robust MPC scheme (23) only for an inﬁnitely lar ge, and sufﬁciently informativ e data set D . In practice, using a ﬁnite data set entails that safety is ensured with a probability less than 1. The quantiﬁcation of the probability of having a safe policy for a given, ﬁnite data set D is beyond the scope of this paper , and is arguably best treated by means of the Information Field Theory [9]. The extension of the construction of a safe MPC presented in this section to the general NMPC case is theoretically feasible, but can be computationally intensiv e in practice. This aspect of the problem is beyond the scope if this paper . I V . S A F E S T O C H A S T I C P O L I C I E S In this section, we discuss ﬁrst two intuiti vely appealing methods to generate safe stochastic policies from π θ , see Sections IV -A and IV -B, and detail their computational short- comings in the context of NMPC-based RL discussed in this paper . W e present then an alternativ e approach in Section IV -C. A. Safe stochastic policy via resampling Let us discuss ﬁrst a very natural approach to generating a safe stochastic policy , based on re-generating a random input a until it is in the safe set S ( s ) . This can, e.g., be achieved by the trivial re-sampling Algorithm 1, where % ( . | π θ ( s )) is a probability density centred at π θ ( s ) . One can, e.g., choose for % ( . | . ) a Normal distribution centered at π θ ( s ) , i.e., a ∼ N ( π θ ( s ) , Σ) , (32) V erifying the condition a ∈ S ( s ) can then be done via classic optimization techniques, where one veriﬁes the feasibility of the constraints (21c) when selecting u 0 = a in (21) according to the proposed a . 5 One can verify that the resulting stochastic policy π θ [ a | s ] takes the probability density: π θ [ a | s ] = % ( a | π θ ( s )) µ % ( S ( s )) , (33) where µ % ( S ( s )) is the measure of density % over S ( s ) , i.e., µ % ( S ( s )) = Z S ( s ) % ( a | π θ ( s )) d a . (34) W e observe then that the gradient of the score function required in calculating (12) reads as: ∇ θ log π θ [ a | s ] = ∇ θ log % ( a | π θ ( s )) − ∇ θ log µ % ( S ( s )) , (35) such that ∇ θ log µ % ( S ( s )) = − ∇ θ µ % ( S ( s )) µ % ( S ( s )) . (36) A difﬁculty arising in the re-sampling approach is that if the safe set S ( s ) is not tri vial, ev aluating (36) can only be done via sampling techniques. Sampling techniques are computa- tionally efﬁcient only if verifying the condition a ∈ S ( s ) is inexpensi ve. This is unfortunately not the case in the NMPC context, where v erifying a ∈ S ( s ) requires solving an NLP , or at least a feasibility problem. Furthermore, ev aluating the gradient of the measure ∇ θ µ % ( S ( s )) via sampling is in general ev en more difﬁcult. B. Safe stochastic policy via softmax W e consider ne xt a classic approach in RL to generate stochastic policies, based on the softmax approach, b ut adapted to the optimization-based policy approximation. Consider the following modiﬁcation of (21), based on the primal interior- point method [8], using a logarithmic barrier: Φ ? τ ( a , s , θ ) = min u Φ( x , u , θ ) − τ X i log ( h i ( x , u , θ )) (37a) s . t . f ( x , u , s , θ ) = 0 , (37b) u 0 = a . (37c) Infeasible inputs a ought to be treated by assigning an inﬁnite value to Φ ? τ . A stochastic polic y can then be deﬁned as a softmax [18] over the cost of (37) , i.e.: π θ [ a | s ] ∝ e − Φ ? τ ( a , s , θ ) . (38) W e then observe that the gradient of the score function can be obtained via NLP sensitivity techniques [15] and reads as ∇ θ log π θ [ a | s ] = −∇ θ Φ ? τ ( a , s , θ ) = −∇ θ L , (39) where L is the Lagrange function associated to (37), i.e., L ( x , u , λ , µ , θ ) = Φ − τ X i h i + λ > 0 ( u 0 − a ) + λ > f , (40) and λ , λ 0 are the multipliers associated to constraint (37b) and (37c), respectiv ely . Sampling the softmax policy (38) requires, in general, Importance Sampling techniques like the Metropolis-Hastings Algorithm (MHA), allowing one to sample an arbitrary con- tinuous distribution. The dif ﬁculty with such techniques is that the y typically require a lar ge number of e valuations of (37) for generating each sample of π θ [ a | s ] . Hence, while the simplicity of this approach is appealing, it presents the signiﬁcant drawbacks that sampling polic y (38) can be very expensi ve. This dif ﬁculty is arguably alleviated in the static case (22), where (37) simpliﬁes to: Φ ? τ ( a , s , θ ) = Φ( s , a , θ ) − τ X i log ( h i ( s , a , θ )) , (41) and the ev aluation of (38) reduces to an ev aluation of the cost and constraints in (41). W e now turn to another option for b uilding safe policies, which is more adequate for a deployment in the NMPC case. C. Optimization-based safe stochastic policy W e will consider a stochastic policy that generates control inputs a ∼ π θ [ a | s ] (42) computed from a = u d 0 ( s , θ , d ) where u d 0 is generated by the randomly disturbed NLP u d ( s , θ , d ) = arg min u Φ d ( x , u , θ , d ) (43a) s . t . f ( x , u , s , θ ) = 0 , (43b) h ( x , u , θ ) ≤ 0 , (43c) for an arbitrary cost function Φ d ( u , s , θ , d ) , and where the parameter d ∈ R n a is drawn from an arbitrary probability distribution, of density % ( d , Σ) , which can, e.g., be a simple Gaussian distribution. One can readily observe that any real- ization of the inputs a = u d 0 ( s , θ , d ) (44) stemming from (43) is in S ( s ) by construction. A simple choice for the cost function Φ d ( u , s , θ , d ) is via a gradient disturbance: Φ d ( u , s , θ , d ) = Φ( u , s , θ ) + d > u 0 . (45) The choice of cost (45) entails that the random v ariable d yields a gradient disturbance in the original problem (21), and introduces stochasticity in the inputs a generated. One can readily observe that generating a sample from (42) requires one to only generate a sample from the chosen density % ( d , Σ) and to solv e the disturbed NMPC problem (43). It is therefore dramatically less expensi ve than the resampling and softmax approach of Sections IV -A and IV -B. W e will show next that computing the gradient of the score function of (42) does not require any sampling, provided that the adequate algorithmic tools are adopted. V . P O L I C Y G R A D I E N T F O R O P T I M I Z A T I O N - BA S E D S A F E S T O C H A S T I C P O L I C Y W e develop next the gradient of the score function associ- ated to (42)-(43). Unfortunately , a technical difﬁculty must be ﬁrst alleviated here. Indeed, ev aluating the stochastic polic y 6 Fig. 1: Illustration of the stochastic policy resulting from (42)- (44) for different values of τ for a ﬁxed s , and u d 0 restricted within a set S ( s ) depicted as the solid line. The resulting probability density π τ θ [ a | s ] is constrained to remain within S ( s ) . For very low values of τ , the density tends to a Dirac- like distribution on the border of the set, see right-side graph. π θ [ a | s ] resulting from (42)-(43) is in general very difﬁcult, because it cannot be simply expressed as a function of the probability density % ( d , Σ) . The mapping d to u d 0 generated by the NLP (43) is in general not bijecti ve, as it acts as a (possibly nonlinear) projection operator of the distribution % into the safe set S ( s ) . A practical outcome of u d 0 being non- bijectiv e is that the resulting stochastic policy becomes Dirac- like on the boundary of the safe set S ( s ) , see Fig. 1 for an illustration. In order to alleviate this dif ﬁculty , similarly to the de- velopments of Sec. IV -B, we will cast (43) in an interior- point context. For computational reasons, we will consider the primal-dual interior point formulation of (43) [8], which have the First-Order Necessary Conditions (FONC): r τ ( z , θ , d ) =   ∇ w Φ d + ∇ w h µ + ∇ w f λ f diag( µ ) h + τ   = 0 (46) for τ > 0 and under the conditions h < 0 , µ > 0 . Here we label w = { u , x } and z = { w , λ , µ } the primal-dual variables of (46). W e will label u τ ( s , θ , d ) the parametric primal solution of (46), and π τ θ [ a | s ] the stochastic policy re- sulting from using a = u τ 0 ( s , θ , d ) . Under standard regularity assumptions [] on (43), the algebraic conditions (46) admit a primal-dual solution that matches the solution of (43) with an accuracy at the order of the relaxation parameter τ . Moreov er , the solution u τ ( s , θ , d ) is guaranteed to satisfy the constraints of (43), hence it deliv ers safe policies. Additionally , (46) is smooth and the mapping d to u d 0 becomes bijective under some mild conditions. W e therefore propose to use (46) as a smooth surrogate for (43). W e will then use the sensiti vities of (46) to compute the gradient of the score function of π τ θ . Fig. 1 provides an illustration of the stochastic policy deliv ered by (46) for different values of τ , and how the stochastic policy adopts a Dirac-like shape on the border of the safety set when τ → 0 . In the following, we will use the notation g for the ﬁrst block of m inputs resulting from (46), i.e.: g ( s , θ , d ) = u τ 0 ( s , θ , d ) ≈ u d 0 ( s , θ , d ) , (47) deliv ering a from (46). The stochastic policy π θ then results from the transformation of the probability density d ∼ % ( d , Σ) via g , and can be ev aluated using [7] π θ [ a | s ] = %  g − 1 , Σ  det  ∂ g − 1 ∂ a      a , θ , s , (48) where function g − 1 is such that d = g − 1 ( a , θ , s ) (49) for any d and associated a deliv ered by (47). The (local) ex- istence of g − 1 is guaranteed by the implicit function theorem if ∂ g ∂ d is full rank. W e will use (48) to compute the gradient of the score function of π θ . For the sake of completeness, we provide hereafter a Lemma establishing the rank of ∂ g ∂ d for the gradient disturbance strategy (45). Lemma 1: For the choice of cost function (45), and if (43) satisﬁes LICQ and SOSC, the Jacobian ∂ g ∂ d of function g implicitly deﬁned by (46)-(47) is full rank for any τ > 0 . Pr oof: for the sake of simplicity , we will prove the result using the primal interior-point conditions corresponding to (46). The Lemma will then hold from the equiv alence between the primal-dual and primal interior-point problem [15]. The primal interior-point conditions read as [15]:  ∇ w Φ d + τ ∇ w h diag( h ) − 1 + ∇ w f λ f  = 0 . (50) The Implicit Function Theorem (IFT) then guarantees that:  H ∇ w f ∇ w f > 0   ∂ w ∂ d ∂ λ ∂ d  = −  ∇ wd Φ d 0  , (51) where H is the Jacobian of the ﬁrst row in (50). Deﬁning N the null space of ∇ w f > , i.e., ∇ w f > N = 0 , one can verify that using ∇ u 0 d Φ d = I n a × n a from (45): ∂ w ∂ d = −N  N > H N  − 1 N > ∇ wd Φ d (52) = −N  N > H N  − 1 N > 0 , (53) where N 0 =  I n a × n a 0 . . . 0  N . The in vertibility of N > H N is guaranteed if (43) satisﬁes LICQ and SOSC. It follows that ∂ g ∂ d = −N 0  N > H N  − 1 N > 0 . (54) Since the dynamics f cannot restrict the input u in (43), N spans the full space of u , and therefore N must span the full input space for u 0 , such that N 0 is full rank. As a result, (54) is full rank. One can observe that Lemma 1 is also trivially valid in the static case. W e ought to ca veat Lemma 1 by observing that while matrix ∂ g ∂ d is full rank for any τ > 0 , it can nonetheless tend to a rank-deﬁcient matrix for τ → 0 . This issue will be discussed in Proposition 2 and in the following remarks. The following Lemma provides the sensitivity of function (49), which will be required to compute the stochastic policy gradient. 7 Lemma 2: If (43) satisﬁes SOSC and LICQ then the following equalities hold: ∂ g − 1 ∂ θ = −  ∂ g ∂ d  − 1 ∂ g ∂ θ , (55a) ∂ g − 1 ∂ a =  ∂ g ∂ d  − 1 , (55b) for any a , θ , s and d = g − 1 ( a , θ , s ) . Pr oof: W e observe that g  s , θ , g − 1 ( a , θ , s )  = a , ∀ a , θ , s . (56) It follows that d d θ g  s , θ , g − 1 ( a , θ , s )  = ∂ g ∂ θ + ∂ g ∂ d ∂ g − 1 ∂ θ = 0 , (57) which establishes (55a). Moreov er , we observe that d d a g  s , θ , g − 1 ( a , θ , s )  = ∂ g ∂ d ∂ g − 1 ∂ a = I , (58) which establishes (55b). A. Gradient of the scor e function W e can then use (48) to dev elop expressions for computing the gradient of the polic y score function ∇ θ log π θ . This is detailed in the following Proposition. Pr oposition 1: The gradient of the score function for a giv en realization of a obtained from a realization of d via solving (43) reads as: ∇ θ log π θ [ a | s ] = m − % − 1 ∂ % ∂ d  ∂ g ∂ d  − 1 ∂ g ∂ θ ! > , (59) ev aluated at s , θ , d , and where m i = T r  ∂ g ∂ d d d θ i ∂ g − 1 ∂ a      s , θ , d , a . (60) Computational techniques to e v aluate (59)-(60) are provided in Section V -B. Pr oof: Using (48), the score function of π θ [ a | s ] is giv en by: log π θ [ a | s ] = log %  g − 1 , Σ  − log det  ∂ g − 1 ∂ a  . (61) Using (55a) we observe that: ∇ θ log %  g − 1 , Σ  =  % − 1 ∂ % ∂ d ∂ g − 1 ∂ θ  >      s , θ , d (62) = − % − 1 ∂ % ∂ d  ∂ g ∂ d  − 1 ∂ g ∂ θ ! >       s , θ , d , hence providing the second term in (59). From calculus and using (55b), we get: d d θ i log det  ∂ g − 1 ∂ a  = T r  ∂ g − 1 ∂ a  − 1 d d θ i ∂ g − 1 ∂ a ! (63) = T r  ∂ g ∂ d d d θ i ∂ g − 1 ∂ a  = m i , hence providing (60) component-wise. W e now turn to detailing how the sensitivities of functions g and g − 1 can be computed at limited computational cost. B. Sensitivity computation W e provide hereafter some expressions allo wing one to ev aluate the terms in (59)-(60). First, it is useful to provide the sensitivities of function g = z 0 , where z 0 is the ﬁrst m elements of z , solution of (46). If LICQ and SOSC hold [15] for the NLP (43), one can verify that the Implicit Function Theorem (IFT) guarantees that: ∂ r τ ∂ z ∂ z ∂ d + ∂ r τ ∂ d = 0 , ∂ r τ ∂ z ∂ z ∂ θ + ∂ r τ ∂ θ = 0 (64) and therefore ∂ g ∂ d and ∂ g ∂ θ , required in the second term of (59), can be extracted from the m ﬁrst rows of ∂ z ∂ d and ∂ z ∂ θ obtained by solving the linear system (64). Obtaining the second-order term ∂ 2 g − 1 ∂ θ i ∂ a in (63) can be fairly in volved. In order to simplify its computation, we propose to use the following approach. Let us deﬁne: ˜ z = { d , u 1 , . . . , u N − 1 , x , λ , µ } , (65) giv en implicitly by (46) as a function of s , θ , a gi ven. One can then construe ˜ z and therefore d as an implicit function of s , θ , u 0 , with u 0 = a , deﬁned by (46). It follows that function g − 1 = ˜ z 0 , i.e., function g − 1 is giv en by the m ﬁrst entries of ˜ z , implicitly deﬁned by (46). The IFT then naturally applies and delivers: ∂ r τ ∂ ˜ z ∂ ˜ z ∂ a + ∂ r τ ∂ u 0 = 0 , ∂ r τ ∂ ˜ z ∂ ˜ z ∂ θ + ∂ r τ ∂ θ = 0 (66) such that ∂ g − 1 ∂ a , ∂ g − 1 ∂ θ can be extracted from the m ﬁrst rows of ∂ ˜ z ∂ a , ∂ ˜ z ∂ θ , obtained by solving the linear system (66). The second-order term ∂ 2 g − 1 ∂ θ i ∂ a in (63) can be obtained from solving the second-order sensitivity equation of the NLP: ∂ r τ ∂ ˜ z ∂ 2 ˜ z ∂ θ i ∂ a +   ∂ 2 r τ ∂ θ i ∂ ˜ z + X j ∂ 2 r τ ∂ ˜ z ∂ z j ∂ ˜ z j ∂ θ i   ∂ ˜ z ∂ a + ∂ 2 r τ ∂ θ i ∂ a + X j ∂ 2 r τ ∂ a ∂ ˜ z j ∂ ˜ z j ∂ θ i = 0 . (67) One can solve the linear system (67) for ∂ 2 ˜ z ∂ θ i ∂ a , and therefore obtain ∂ 2 g − 1 ∂ θ i ∂ a . W e ought to underline here that the linear systems (64), (66) and (67) are large but very sparse if coming from an NMPC scheme. Their sparsity ought to be exploited for computational efﬁcienc y both when forming and solving the systems. Unfor- tunately , the computational complexity of ev aluating the linear system (67) grows with the number of parameters θ , which is unfa vorable for rich parametrizations of the policy . C. Limit case of the gradient of the score function One ought to observe that matrix ∂ g ∂ d becomes asymptoti- cally ( τ → 0 ) rank deﬁcient if some constraints are active at the ﬁrst stage k = 0 , hence restricting a on some manifold of R n a . As a result, it is not obvious that the terms in volv ed in (59) and (63) are asymptotically well deﬁned for τ → 0 . The following proposition alle viates this concern in some cases. The other cases are discussed after the Proposition. 8 Pr oposition 2: For the choice of cost function (45), and if the MPC model dynamics and constraints are not depending on the parameters, i.e., ∇ θ h = 0 , ∇ θ f = 0 , and if ∂ ∇ 2 w Φ d ∂ θ = 0 then the expressions  ∂ g ∂ d  − 1 ∂ g ∂ θ ,  ∂ g ∂ d  − 1 ∇ θ i ∂ g ∂ θ , (68) are well deﬁned for τ → 0 if Problem (43) fulﬁls LICQ and SOSC. Pr oof: W e will proceed with proving that the expressions (68) are well deﬁned in the sense of the pseudo-in verse in an active-set setting deployed on (43). The asymptotic result (68) will then hold from the con ver gence of the Interior-Point solution to the acti ve-set one. Consider A the (strictly) activ e set of (43), i.e., the set of indices i such that h i = 0 , µ i > 0 at the solution. W e observe that  H ∇ w q ∇ w q > 0   ∂ w ∂ d ∂ ν ∂ d  = −  ∇ wd Φ d 0  , (69) where H is the Hessian of the Lagrange function associated to (43) and q =  f h A  , ν =  λ µ A  . (70) Deﬁning N A the null space of ∇ w q > , i.e. ∇ w q > N A = 0 , and following the same line as in Lemma 1, we observe that: ∂ g ∂ d = −N A 0  N > A H N A  − 1 N > A 0 , (71) where N A 0 =  I m × m 0 . . . 0  N A . Using a similar reasoning, since ∂ ∇ w q ∂ θ = 0 and ∂ H ∂ θ = 0 , we observe that: ∂ k θ g = −N A 0  N > A H N A  − 1 N > A ∂ k θ ∇ w Φ d , (72) where ∂ k θ are multi-indexed differential operators with respect to θ , using any multi index k . This entails that ∂ k θ g can be expressed as ∂ k θ g = N A 0 Q for some matrix Q . Consider then the linear system in the unknown matrix X : ∂ g ∂ d X + ∂ k θ g = 0 . (73) Since N > A H N A is full rank, N A 0 Q is in the span of matrix N A 0  N > A H N A  − 1 N > A 0 . It follows that (73) is consistent, such that it can be solved for X using, e.g., a pseudo-inv erse. As a result, by continuity of the solution manifold deﬁned by (46), the expressions (68) ha ve a well-deﬁned limit for τ → 0 , giv en by the solution of the linear system (73). It is important to note here that Proposition 2 relies on the safety constraints h being independent of the parame- ters θ . This requirement is not an artiﬁcial effect of the approach, but rather a fundamental limitation of deploying the stochastic policy gradient approach on safety sets. Indeed, one can observe that the probability density π θ [ a | s ] can be discontinuous at the border ∂ S ( s ) of the set set S ( s ) deﬁned by h , as the probability density is (possibly) non- zero at ∂ S ( s ) and zero outside. As a result, the gradient ∇ θ log π θ [ a | s ] can be ill-deﬁned for a ∈ ∂ S ( s ) if the changes in the parameters θ can move the set border ∂ S ( s ) . The Interior-Point approach proposed in Section IV -C alle viates Fig. 2: Illustration of the gradient of the stochastic policy resulting from (42)-(44) for different values of τ , s ﬁxed, and u d 0 restricted within a set S ( s ) depicted as the solid circle. The ﬁrst row of graphs depict the gradient with respect to parameter θ 1 for which ∇ θ 1 h , ∇ θ 1 f = 0 , while the second row depicts the gradient with respect to parameter θ 2 for which ∇ θ 2 h 6 = 0 . The last ro w depicts the conditioning of matrix ∂ g ∂ d . As predicted by Proposition 2 for τ → 0 , ∂ g ∂ d tends to a rank-deﬁcient matrix for a = u d 0 → ∂ S ( s ) , and the gradient ∇ θ 2 log π θ degenerates while ∇ θ 1 log π θ does not. this problem, at the e xpense of keeping τ ﬁnite rather than using τ → 0 . The the problem is av oided by smoothing the transition from a non-zero density in S ( s ) to a zero density outside. These observations are illustrated in Figures 1-2. V I . S A F E R L S T E P S The methodology described so far allows one to deploy a safe policy and safe exploration using a robust NMPC scheme, in order to compute the deterministic policy gradient, and determine directions in the parameter space θ that impro ve the closed-loop performance of the resulting control polic y . Howe ver , taking a step in θ can arguably jeopardise the safety of the policy , e.g., by modifying the constraints, or the models underlying the robust NMPC scheme. The problem of modifying the NMPC parameters while maintaining safety is arguably a complex one, and beyond the scope of this paper . Howe ver , in the robust linear MPC context detailed in Section III-B, there is a simple approach to handle this problem, which we detail here. W e observe that a classic gradient step of step- size α > 0 reads as: θ = θ − − α ∇ θ J (74) where θ − is the previous vector of parameters. One can trivially observ e that the gradient step can be construed as 9 the solution of the optimization problem: min θ 1 2 k θ − θ − k 2 + α ∇ θ J > ( θ − θ − ) . (75) Imposing the data-dri ven safe-design constraints (31) on the gradient step generating the new parameters can then simply be cast as the following constrained optimization problem: min θ , ϑ 1 2 k θ − θ − k 2 + α ∇ θ J > ( θ − θ − ) (76a) s . t . s k +1 − F 0 ( s k , a k , θ ) − V X i =1 N D X k =0 ϑ i,k W i = 0 , (76b) V X i =1 ϑ i,k = 1 , ∀ k = 0 , . . . , N D , (76c) ϑ i,k ≥ 0 ∀ k = 0 , . . . , N D , i = 1 , . . . , V , (76d) where (76b)-(76d) are the algebraic conditions testing (31). W e observe that unfortunately the complexity of (76) grows with the amount of data N D in use. In practice, the data set D should arguably be limited to incorparate relev ant state transitions. A data compression technique has been proposed in [21] to alle viate this issue in the case the nominal model F 0 is ﬁxed. Future work will improve on this baseline. V I I . I M P L E M E N T A T I O N & I L L U S T R A T I V E E X A M P L E In this section, we pro vide some details on ho w the principle presented in this paper can be implemented, and provide an illustrative example of this implementation. At each time instant k , for a given state s k , a sample is drawn from the stochastic policy π θ , computed according to (46) with d ∼ % ( ., Σ) . The gradient of the score function is then computed using (59). The data are collected to compute (12)- (13) either on-the-ﬂy or in a batch fashion. The policy gradient estimation (12) is then used to compute the safe parameter update according to (76). A. RL appr oach In this example, the polic y gradient was ev aluated using batch Least-Squares T emporal-Difference (LSTD) techniques, whereby for each ev aluation, the closed-loop system is run S times for N t time steps, hence generating S trajectory samples of duration N t . The value function estimations is constructed using: N t X k =0 S X i =1 δ V ( s k,i , a k,i , s k +1 ,i ) ∇ v ˆ V v π θ ( s k,i ) = 0 , (77a) δ V := L ( s k,i , a k,i ) + γ ˆ V v π θ ( s k +1 ,i ) − ˆ V v π θ ( s k,i ) , (77b) using a linear value function approximation ˆ V v ( s ) = % ( s ) > v . (78) In this example, (78) uses a simple quadratic function in % ( s ) to parametrize ˆ V v . W e observe that (77) is linear in the parameters v , and therefore straightforward to solve. Howe ver , it can be ill-posed on some data sets, and ought to be solve using, e.g., a Moore- Penrose pseudo-inv erse. The policy gradient estimation is then obtained using (12): \ ∇ θ J ( π θ ) = N t X k =0 S X i =1 ∇ θ log π θ ( s k,i ) δ V . (79) B. Rob ust linear MPC scheme While the proposed theory is not limited to linear problems, for the sake of clarity , we propose to use a fairly simple robust linear MPC example using multiple models and process noise. W e will consider the policy as deliv ered by the follo wing robust MPC scheme based on multiple models and a linear feedback policy: min u , x N M X j =0 k x j,N − ¯ x k 2 + N − 1 X k =0      x j,k − ¯ x u j,k − ¯ u      2 ! (80a) s . t . x j,k +1 = A 0 x j,k + B 0 u j,k + b 0 + W j , (80b) k x j,k k 2 ≤ 1 , ∀ j = 0 , . . . , N M , k = 1 , . . . N , (80c) x j, 0 = s , ∀ j = 1 , . . . , N M , (80d) u j, 0 = u k, 0 , ∀ k , j = 0 , . . . , N M , (80e) u j,k = u 0 ,k − K ( x j,k − x 0 ,k ) , j = 1 , . . . , N M , (80f) where A 0 , B 0 , b 0 yield the MPC nominal model correspond- ing to F 0 , with W 0 = 0 , and W 1 ,...M capture the vertices of the dispersion set outer approximation. Hence model j = 0 serves as nominal model and models j = 1 , . . . , N M capture the state dispersion over time. The linear feedback matrix K is possibly part of the MPC parameters θ , and is a (rudimentary) structure providing a feedback π s as described in Section II-B. In practice, (80) is equiv alent to a tube-based MPC. C. Simulation setup & results The simulations proposed here use the same setup as the companion paper [11] treating the stochastic policy gradient case, so as to make comparisons straightforward. The experi- mental parameters are summarized in T able I and: x k +1 = A real x k + B real u k + n , (81) where the process noise n is selected Normal centred, and clipped to a ball. The real system was selected as: A real = κ  cos β sin β sin β cos β  , B real =  1 . 1 0 0 0 . 9  . (82) The real process noise n is chosen normal centred of covari- ance 1 3 10 − 2 I , and restricted to a ball of radius 1 2 10 − 2 . The initial nominal MPC model is chosen as: A 0 =  cos ˆ β sin ˆ β sin ˆ β cos ˆ β  , B 0 =  1 0 0 1  , b 0 =  0 0  . (83) and N M = 4 with: W 1 = 1 10  − 1 − 1  , W 2 = 1 10  +1 − 1  (84a) W 3 = 1 10  +1 +1  , W 4 = 1 10  − 1 +1  . (84b) 10 T ABLE I: Simulation parameters Parameter V alue Description γ 0.99 Discount factor Σ I Exploration shape σ 10 − 3 Exploration cov ariance τ 10 − 2 Relaxation parameter β 22 ◦ Real system parameter ˆ β 20 ◦ Model parameter N t 20 Sample length S 30 Number of sample per batch N 10 MPC prediction horizon The baseline stage cost is selected as: L = 1 20 k x − x ref k 2 + 1 2 k u − u ref k 2 (85) and serves as the baseline performance criterion to e valuate the closed-loop performance of the MPC scheme. W e considered two cases, using deterministic initial con- ditions s 0 =  cos 60 ◦ sin 60 ◦  > . Both cases consider the parameters θ = { ¯ x , ¯ u , A 0 , B 0 , b 0 , K, W } . The ﬁrst case considers a stable real system with κ = 0 . 95 , the second case considers an unstable real system with κ = 1 . 05 . In both cases, the target reference ¯ x was provided, together with the input reference ¯ u deli vering a steady-state for the nominal MPC model. The feedback matrix K was chosen as the LQR controller associated to the MPC nominal model. T able I reports the algorithmic parameters. Case 1 used a step size α = 0 . 05 , the second case used a step size α = 0 . 01 . The results for the ﬁrst case are reported in Figures 3-7. One can observe in Fig. 3 that the closed-loop performance is improv- ing over the RL steps. Figure 4 sho ws that the improvement takes place via driving the closed-loop trajectories of the real system closer to the reference, without jeopardising the system safety . Figure 5 shows how the RL algorithm uses the MPC nominal model to improve the closed-loop performance. One can readily see from Figure 5 that RL is not simply performing system identiﬁcation, as the nominal MPC model developed by the RL algorithm does not tend to the real system dynamics. Figure 6 sho ws how the RL algorithm reshapes the dispersion set. The upper-left corner of the set is the most critical in terms of performance, as it activ ates the state constraint k x k 2 ≤ 1 , and is mo ved inward to gain performance. The constrained RL step (76) ensures that the RL algorithm cannot jeopardize the system safety . In Figure 7, one can see that the RL algorithm does not use much the degrees of freedom provided by adapting the MPC feedback matrix K . The results for case 2 are reported in Figures 8-12. Similar comments hold for case 2 as for case 1. The instability of the real system does not challenge the proposed algorithm, ev en though a smaller step size σ had to be used as the RL algorithm appears to more sensitive to noise. V I I I . C O N C L U S I O N This paper proposed a technique to deploy stochastic policy gradient methods where the stochastic policy is supported by a stochastically disturbed constrained parametric optimization problem. This approach allows one to restrict the support of Fig. 3: Case 1. Evolution of the closed-loop performance J ov er the RL steps. The solid line represents the estimation of J based on the samples obtained in the batch. The dashed line represent the standard deviation due to the stochasticity of the system dynamics and policy disturbances. Fig. 4: Case 1. Closed-loop system trajectories. The initial conditions s 0 are reported, as well as the target state reference x ref (circle), and the MPC reference ¯ x at the ﬁrst RL step and at the last one (grey and black + symbol respectively). The trajectories at the ﬁrst and last RL steps are reported as the light and dark grey polytopes. The solid black curve represents the state constraint k x k 2 ≤ 1 . Fig. 5: Case 1. Evolution of the nominal MPC model ov er the RL steps. W e report here the difference between the nominal model used in the MPC scheme and the real system. 11 Fig. 6: Case 1. Evolution of the MPC model biases W 1 ,...M ov er the RL steps. The light grey polytope depicts the biases at the ﬁrst RL step. and the points show s k +1 − F 0 ( s k , a k , θ ) for all the samples of the ﬁrst batch of data. The cloud of point is inside the black thick quadrilateral thanks to the constrained RL step (76). Fig. 7: Case 1. Evolution of the MPC feedback matrix K from its initial value. The feedback is only marginally adjusted by the RL algorithm. After 100 RL steps, the adaptation of the feedback gain K has not yet reached its steady-state value. Fig. 8: Case 2, similar to Fig. 3. Fig. 9: Case 2, similar to Fig. 4 Fig. 10: Case 2, similar to 5. Fig. 11: Case 2, similar to Fig. 6. Fig. 12: Case 2, similar to Fig. 7. 12 the stochastic policy to a safe set described via constraints. In particular , robust Nonlinear Model Predictive Control, where safety requirements can be imposed explicitly , can be selected as a parametric optimization problem. Imposing restrictions on the support of the stochastic policy creates some technical challenges when computing the gradient of the policy score function required in the computation of the stochastic policy gradient. Computationally inexpensi ve methods are proposed here to tackle these challenges, using interior-point methods and techniques from parametric Nonlinear Programming. The speciﬁc case of robust linear Model Predictiv e Control, where the prediction model is linear , is further de veloped, and a methodology to impose safety requirements throughout the learning is proposed. The proposed techniques are illustrated in simple simulations, showing their behavior . This paper has a companion paper [11] in vestigating the deterministic policy gradient approach in the same context as in this paper . The stochastic policy gradient approach is theoretically simple and appealing, and requires less assumptions that its deterministic counterpart presented in [11]. It also requires solving a single NLP per time instant, as opposed to two in the deterministic case. Howe ver , the computational complexity of the stochastic approach presented here can be signiﬁcantly higher that the deterministic policy approach of [11] when the number of parameters θ is larger than the input size n a . This effect is a direct result of the computational complexity of ev aluating the second-order sensitivities required to form the gradient of the stochastic policy score function, see (67). R E F E R E N C E S A N D N OT E S [1] Pieter Abbeel, Adam Coates, Morgan Quigley , and Andrew Y . Ng. An application of reinforcement learning to aerobatic helicopter ﬂight. In In Advances in Neural Information Processing Systems 19 , page 2007. MIT Press, 2007. [2] D. Bernardini and A. Bemporad. Scenario-based model predictiv e control of stochastic constrained linear systems. In Pr oceedings of the 48h IEEE Confer ence on Decision and Contr ol (CDC) held jointly with 2009 28th Chinese Contr ol Conference , pages 6333–6338, Dec 2009. [3] D. Bertsekas. Dynamic Pr ogramming and Optimal Contr ol , volume 2. Athena Scientiﬁc, 3rd edition, 2007. [4] D.P . Bertsekas. Dynamic Pr ogramming and Optimal Contr ol , volume 1 and 2. Athena Scientiﬁc, Belmont, MA, 1995. [5] D.P . Bertsekas and I.B. Rhodes. Recursiv e state estimation for a set- membership description of uncertainty. IEEE T ransactions on A utomatic Contr ol , 16:117–128, 1971. [6] D.P . Bertsekas and S.E. Shreve. Stochastic Optimal Contr ol: The Discr ete Time Case . Athena Scientiﬁc, Belmont, MA, 1996. [7] D.P . Bertsekas and J.N. Tsitsiklis. Optimum Experimental Designs . Mass. : Athena Scientiﬁc. Belmont, 2002. [8] Lorenz T . Biegler . Nonlinear Pro gramming . MOS-SIAM Series on Optimization. SIAM, 2010. [9] T. Ensslin. Information Field Theory . arXiv:1301.2556 [astro-ph.IM], 2013. [10] S. Gros and M. Zanon. Data-Driven Economic NMPC using Reinforce- ment Learning. IEEE T ransactions on Automatic Contr ol , 2018. (in press). [11] S. Gros and M. Zanon. T owards Safe Reinforcement Learning Using NMPC and Policy Gradients - Deterministic case (P art II). IEEE T ransactions on Automatic Control , 2019. (submitted). [12] J. Fernandez J. Garcia. A comprehensiv e survey on safe reinforcement learning. Journal of Machine Learning Resear ch , 16:1437–1480, 2013. [13] I. Kolmanovsk y and E.G. Gilbert. Theory and computation of distur- bance in variant sets for discrete-time linear systems. Math. Probl. Eng. , 4(4):317–367, 1998. [14] David Q. Mayne. Model predictive control: Recent dev elopments and future promise. Automatica , 50(12):2967 – 2986, 2014. [15] J. Nocedal and S.J. Wright. Numerical Optimization . Springer Series in Operations Research and Financial Engineering. Springer , 2 edition, 2006. [16] P . O. M. Scokaert and D. Q. Mayne. Min-max feedback model predictiv e control for constrained linear systems. IEEE T ransactions on A utomatic Contr ol , 43:1136–1142, 1998. [17] David Silver , Guy Lever , Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller . Deterministic policy gradient algorithms. In Pr oceedings of the 31st International Conference on Mac hine Learning , ICML ’14, pages I–387–I–395, 2014. [18] Richard S. Sutton and Andrew G. Barto. Intr oduction to Reinforcement Learning . MIT Press, Cambridge, MA, USA, 1st edition, 1998. [19] Richard S. Sutton, David McAllester , Satinder Singh, and Y ishay Man- sour . Policy gradient methods for reinforcement learning with function approximation. In Pr oceedings of the 12th International Conference on Neur al Information Pr ocessing Systems , NIPS’99, pages 1057–1063, Cambridge, MA, USA, 1999. MIT Press. [20] Shouyi W ang, W anpracha Chaovalitwongse, and Robert Babuska. Ma- chine learning algorithms in bipedal robot control. T rans. Sys. Man Cyber P art C , 42(5):728–743, September 2012. [21] M. Zanon and Gros. Safe Reinforcement Learning Using Robust MPC. In T ransaction on Automatic Control, Arc hivx , 2019. (submitted). S ´ ebastien Gros received his Ph.D degree from EPFL, Switzerland, in 2007. After a journey by bicycle from Switzerland to the Everest base camp in full autonomy , he joined a R&D group hosted at Strathclyde University focusing on wind turbine control. In 2011, he joined the univ ersity of KU Leuven, where his main research focus was on op- timal control and fast NMPC for complex mechan- ical systems. He joined the Department of Signals and Systems at Chalmers Univ ersity of T echnology , G ¨ oteborg in 2013, where he became associate Prof. in 2017. He is now full Prof. at NTNU, Norway and guest Prof. at Chalmers. His main research interests include numerical methods, real-time optimal control, reinforcement learning, and energy-related applications. Mario Zanon received the Master’ s degree in Mechatronics from the University of T rento, and the Dipl ˆ ome d’Ing ´ enieur from the Ecole Centrale Paris, in 2010. After research stays at the KU Leuven, Univ ersity of Bayreuth, Chalmers Uni versity , and the Univ ersity of Freiburg he received the Ph.D. degree in Electrical Engineering from the KU Leuven in November 2015. He held a Post-Doc researcher position at Chalmers Univ ersity until the end of 2017 and is now Assistant Professor at the IMT School for Advanced Studies Lucca. His research interests include numerical methods for optimization, economic MPC, optimal control and estimation of nonlinear dynamic systems, in particular for aerospace and automotiv e applications.

Towards Safe Reinforcement Learning Using NMPC and Policy Gradients: Part I - Stochastic case

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment