Probabilistic model predictive safety certification for learning-based control
Reinforcement learning (RL) methods have demonstrated their efficiency in simulation environments. However, many applications for which RL offers great potential, such as autonomous driving, are also safety critical and require a certified closed-loo…
Authors: Kim P. Wabersich, Lukas Hewing, Andrea Carron
1 Probabilistic model predicti v e safety certification for learning-based control Kim P . W abersich, Lukas He wing, Andrea Carron, Melanie N. Zeilinger Abstract —Reinfor cement lear ning (RL) methods hav e demon- strated their efficiency in simulation. Howev er , many of the applications for which RL offers gr eat potential, such as au- tonomous driving, ar e also safety critical and requir e a certified closed-loop beha vior in order to meet safety specifications in the presence of physical constraints. This paper introduces a concept called pr obabilistic model pr edictiv e safety certification (PMPSC), which can be combined with any RL algorithm and pro vides pro vable safety certificates in terms of state and input chance constraints for potentially large-scale systems. The certificate is realized thr ough a stochastic tube that safely connects the current system state with a terminal set of states that is known to be safe. A novel formulation allows a recursively feasible real-time computation of such pr obabilistic tubes, despite the presence of possibly unbounded disturbances. A design procedur e for PMPSC relying on Bayesian inference and recent advances in probabilistic set in variance is presented. Using a numerical car simulation, the method and its design procedur e are illustrated by enhancing an RL algorithm with safety certificates. Index T erms —Reinfor cement learning (RL), Stochastic sys- tems, Predictive control, Safety I . I N T RO D U C T I O N W HILE the field of reinforcement learning demonstrated various classes of learning-based control methods in research-driv en applications [ 1 ], [ 2 ], very fe w results hav e been successfully transferred to industrial applications that are safety-critical , i.e. applications that are subject to physical and safety constraints. In industrial applications, successful control methods are often of simple structure, such as the Proportional–Integral–Deri vati ve (PID) controller [ 3 ] or linear state feedback controller [ 4 ], which require an expert to cautiously tune them manually . Manual tuning is generally time consuming and therefore expensi ve, especially in the presence of safety specifications. Modern control methods, such as model predictiv e control (MPC), tackle this problem by providing safety guarantees with respect to adequate system and disturbance models by design, reducing manual tuning requirements. The various successful applications of MPC to safety critical systems reflect these capabilities, see e.g. [ 5 ], [6] for an overvie w . While provable safety of control methods facilitates the ov erall design procedure, the tuning of various parameters, such as the cost function, in order to achiev e a desired The authors are members of the Institute for Dynamic Sys- tems and Control, ETH Zürich, Zürich CH-8092, Switzerland (e-mail: [wkim|lhewing|carrona|mzeilinger]@ethz.ch) This work was supported by the Swiss National Science Foundation under grant no. PP00P2 157601/1. Andrea Carron’s research was supported by the Swiss National Centre of Competence in Research NCCR Digital Fabrication (Agreement #51NF40_141853). closed-loop behavior , still needs to be done manually and often requires significant experience. In contrast, RL methods using trial-and-error procedures are often more intuitiv e to design and are capable of iterativ ely computing an improved policy . The downside of many RL algorithms, howe ver , is that explicit consideration of physical system limitations and safety requirements at each time step cannot be addressed, often due to the complicated inner workings, and this limits their applicability in many industrial applications [7]. This paper aims to address this problem by introducing a probabilistic model predictiv e safety certification (PMPSC) scheme for learning-based controllers, which can equip any controller with probabilistic constraint satisfaction guarantees. The scheme is motiv ated by the following observation. Often, an MPC controller with a short prediction horizon is sufficient in order to provide safety for a system during a closed-loop operation, even though the same horizon would not be enough to achiev e a desired performance. For example, in the case of autonomous driving, checking if it is possible to transition the car into a safe set of states (e.g. brake down to low velocity) can be done efficiently by solving an open loop optimal control problem with a relatively small planning horizon (e.g. using maximum deceleration). At the same time, a much longer planning horizon for an MPC controller, or e ven another class of control policies, would be required in order to provide a comfortable and foresightful driving experience. This motiv ates the combination of ideas from MPC with RL methods in order to achieve a safe and high performance closed-loop system operation requiring a small amount of manual tuning. More precisely , a learning-based input action is certified as safe if it leads to a safe state, i.e., a state for which a potentially low-performance, but online computable and safe backup controller exists for all future times. By repeatedly computing such a backup controller for the state predicted one step ahead after application of the learning input, it is either certified as safe and is applied, or it is overwritten by the previous safe backup controller . The resulting concept can be seen as a safety filter that only filters proposed learning signals for which we cannot guarantee constraint satisfaction in the future. Contributions: W e provide a safety certification framework which allows for enhanced arbitrary learning-based control methods with safety guarantees 1 and which is suitable for possibly large-scale systems with continuous and chance constrained input and state spaces. In order to enable ef ficient 1 Inputs provided by a human can be similarly enhanced by the safety certification scheme, which relates e.g. to the concept of electronic stabilization control from automotive engineering. 2 implementation and scalability , we provide an online algorithm together with a data-driven synthesis method to compute backup solutions that can be realized by real-time capable and established model predictive control (MPC) solvers, e.g. [ 8 ], [ 9 ], [ 10 ]. Compared to previously presented safety frameworks for learning-based control, e.g. [ 11 ], the set of safe state and action pairs is implicitly represented through an online optimization problem, enabling us to circumvent its e xplicit offline computation, which generally suffers from the curse of dimensionality . Unlike related concepts, such as those presented in [ 12 ], [ 13 ], [ 14 ], [ 15 ], we consider possibly nonlinear stochastic systems that can be represented as linear systems with bounded model uncertainties and possibly unbounded additive noise. For this class of systems, we present an automated, parametrization free and data-driven design procedure that is tailored to the context of learning the system dynamics. Due to our specific formulation, we can maintain recursiv e feasibility of the underlying optimization problem, which is a distinctive difference to related stochastic MPC and safety filter schemes. Using the example of safely learning to track a trajectory with a car , we show ho w to construct a safe reinforcement learning algorithm using our framework in combination with a basic policy search algorithm. I I . R E L A T E D W O R K Driv en by rapid progress in reinforcement learning there is also a growing awareness regarding safety aspects of machine learning systems [ 7 ], see e.g. [ 16 ] for a comprehensive overvie w . As opposed to most methods de veloped in the context of safe RL, the approach presented in this paper keeps the system safe at all times, including exploration, and considers continuous state and action spaces. This is possible through the use of models and corresponding uncertainty estimates of the system, which can be sequentially improved by , e.g., an RL algorithm to allow greater exploration. In model-free safe reinforcement learning methods, policy search algorithms have been proposed, e.g. [ 17 ], which provide expected safety guarantees by solving a constrained policy optimization using a modified trust-region policy gradient method [ 18 ]. Efficient policy tuning with respect to best worst- case performance (also worst-case stability under physical con- straints) can be achiev ed using Bayesian min-max optimization, see e.g. [ 19 ], or by safety-constrained Bayesian optimization as e.g. in [ 20 ], [ 21 ]. These techniques share the limitation that they need to be tailored to a task-specific class of policies. Furthermore, most techniques require repeated execution of experiments, which prohibits fully autonomous safe learning in ‘closed-loop’. In [ 22 ], a method was dev eloped that allows for the analysis of a giv en closed-loop system (under an arbitrary RL policy) with respect to safety , based on a probabilistic system model. An extension of this method is presented in [ 23 ], where the problem of updating the policy is in vestigated and practical implementation techniques are provided. The techniques require an a-priori known L yapunov function and Lipschitz continuity of the closed-loop learning system. In the context of model- based safe reinforcement learning, sev eral learning-based model predictiv e control approaches are av ailable. The method proposed in [ 24 ] conceptually provides deterministic guarantees on robustness, while statistical identification tools are used to identify the system in order to improve performance. In [ 25 ], the scheme mentioned has been tested and validated onboard using a quadcopter . In [ 26 ], a robust constrained learning- based model predictive control algorithm for path-tracking in of f-road terrain is studied. The experimental ev aluation shows that the scheme is safe and conservati ve during initial trials, when model uncertainty is high and very performant once the model uncertainty is reduced. Re garding safety , [ 27 ] presents a learning model predictive control method that provides theoretical guarantees in the case of Gaussian process model estimates. For iterative tasks, [ 28 ] proposes a learning model predictiv e control scheme that can be applied to linear system models with bounded disturbances. Instead of using model predictive control techniques, PILCO [ 29 ] allows the calculation of analytic policy gradients and achiev es good data ef ficiency , based on non-parametric Gaussian process regression. The pre viously discussed literature provides specific re- inforcement learning algorithms that are tied to a specific, mostly model predictive control based, policy . In contrast, the proposed concept uses MPC-based ideas in order to establish safety independently of a specific reinforcement learning polic y . This offers the opportunity to apply RL for learning more complex tasks than for example steady-state stabilization, which is usually considered in model predictive control. Many reinforcement learning algorithms are able to maximize rew ards from a black-box function, i.e. rewards that are only available through measurements, which would not be possible using a model predicti ve controller , where the cost enters the corresponding online optimization problem explicitly . Closely related to the approach proposed in this paper, the concept of a safety framework for learning-based control emerged from robust reachability analysis, robust inv ariance, as well as classical L yapunov-based methods [ 30 ], [ 11 ], [ 31 ], [ 32 ]. The concept consists of a safe set in the state space and a safety controller as originally proposed in [ 33 ] for the case of perfectly known system dynamics in the context of safety barrier functions. While the system state is contained in the safe set, any feasible input (including learning-based controllers) can be applied to the system. Howe ver , if such an input would cause the system to leave the safe set, the safety controller interferes. Since this strategy is compatible with any learning-based control algorithm, it serves as a universal safety certification concept. Previously proposed concepts are limited to a robust treatment of the uncertainty in order to provide rigorous safety guarantees. This potentially results in a conservati ve system beha vior , or e ven the ill-posedness of the ov erall safety requirement e.g. in the case of frequently considered Gaussian distributed additive system noise, which has unbounded support. Compared to pre vious research using similar model pre- dictiv e control-based safety mechanisms such as [ 12 ], [ 13 ], [ 14 ], [ 15 ], we introduce a probabilistic formulation of the safe set and consider safety in probability for all future times, allowing one to prescribe a desired degree of conservatism 3 and address disturbance distrib utions with unbounded support. The proposed method only requires an implicit description of the safe set as opposed to an explicit representation, which enables scalability with respect to the state dimension, while being independent of a particular RL algorithm. I I I . P R E L I M I NA R I E S A N D P R O B L E M S TA T E M E N T A. Notation The set of symmetric matrices of dimension n is denoted by S n , the set of positi ve (semi-) definite matrices by ( S n + ) S n ++ , the set of integers in the interval [ a, b ] ⊂ R by I [ a,b ] , and the set of integers in the interval [ a, ∞ ) ⊂ R by I ≥ a . The Minko wski sum of two sets A 1 , A 2 ⊂ R n is denoted by A 1 ⊕A 2 : = { a 1 + a 2 | a 1 ∈ A 1 , a 2 ∈ A 2 } and the Pontryagin set dif ference by A 1 A 2 : = { a 1 ∈ R n | a 1 + a 2 ∈ A 2 , ∀ a 2 ∈ A 2 } . An affine image of a set A 1 ⊆ R n under x 7→ K x is defined as K A 1 : = { K x | x ∈ A 1 } . The i -th row and i -th column of a matrix A ∈ R n × m are denoted by ro w i ( A ) and col i ( A ) , respectiv ely . The expression x ∼ Q x means that a random variable x is distributed according to the distribution Q x , and N ( µ, Σ) is a multiv ariate Gaussian distribution with mean µ ∈ R n and covariance Σ ∈ S n + . The probability of an event E is denoted by Pr( E ) . For a random v ariable x , E ( x ) and v ar( x ) denote the expected value and the variance. B. Pr oblem statement W e consider a-priori unknown nonlinear, time-inv ariant discrete-time dynamical systems of the form x ( k + 1) = f θ ( x ( k ) , u ( k )) + w s ( k ) , ∀ k ∈ I ≥ 0 (1) subject to polytopic state and input constraints x ( k ) ∈ X , u ( k ) ∈ U , and i.i.d. stochastic disturbances w s ( k ) ∼ Q w s . The uncertainties in the function f θ are characterized by the random parameter vector θ ∼ Q θ . For controller design, we consider an approximate model description of the follo wing form x ( k + 1) = Ax ( k ) + B u ( k ) + w θ ( x ( k ) , u ( k )) + w s ( k ) , (2) where A ∈ R n × n , B ∈ R n × m are typically obtained from linear system identification techniques, see e.g. [ 34 ], and w θ ( x ( k ) , u ( k )) accounts for model errors. In order to provide safety certificates, we require that the model error w θ ( x ( k ) , u ( k )) is contained with a certain probability in a model error set W θ ⊂ R n , where W θ is chosen based on a vailable data D = { ( x i , u i , f θ ( x i , u i ) + w s,i ) } N D i =1 , where N D denotes the number of av ailable data points. Assumption III.1 (Bounded model error) . The deviation between the true system (1) and the corr esponding model (2) is bounded, i.e. Pr w θ ( x ( k ) , u ( k )) ∈ W θ ∀ k ∈ I ≥ 0 , x ( k ) ∈ R n , u ( k ) ∈ R m ≥ p θ (3) wher e p θ > 0 denotes the pr obability level and the pr obability is taken with r espect to the random parameter θ . C A principled way to infer a system model of the form (2) for linear systems from av ailable data such that Assumption III.1 is satisfied for a compact domain is discussed in Section (V) . For nonlinear systems, the computation of W θ typically inv olves the solution of a non-con vex optimization problem, which can be approximated e.g. by gridding, similarly as proposed in [ 31 , Remark IV .1]. Remark III.2 (Gaussian processes) . Instead of parametric uncertainty , one could also use non-parametric Gaussian pr ocess r e gression for the dynamics function f θ . The model err or set W θ can then be derived by assuming that f θ has a bounded norm in a repr oducing kernel Hilbert space using the bound pr esented in [ 35 , Theor em 2]. F or function samples of a Gaussian pr ocess on compact domains, one can similarly apply , e.g., the bound presented in [36]. C In this paper , system safety is defined as a required degree of constraint satisfaction in the form of probabilistic state and input constraints, i.e., as chance-constraints of the form Pr( x ( k ) ∈ X ) ≥ p x , Pr( u ( k ) ∈ U ) ≥ p u , (4) for all k ∈ I ≥ 0 with probabilities p x , p u ≥ 0 , where the probability is with respect to all uncertain elements, i.e. parameters θ and noise realizations { w s ( i ) } k − 1 i =0 . The overall goal is to certify safety of arbitrary control signals u L ( k ) ∈ R m , e.g, provided by an RL algorithm. This is achieved by means of a safety policy , which is computed in real time based on the current system state x ( k ) and the proposed input u L ( k ) . A safety policy consists of a safe input u S ( k ) at time k and a safe backup trajectory that guarantees safety with respect to the constraints (4) when applied in future time instances. The safety policy is updated at ev ery time step, such that the first input equals u L ( k ) if that is safe and otherwise implements a minimal safe modification. More formally: Definition III.3. Consider any giv en control signal u L ( k ) ∈ R m for time steps k ∈ I ≥ 0 . W e call a control input u L ( ¯ k ) certified as safe for system (1) at time step ¯ k and state x ( ¯ k ) with respect to a safety policy π S : R n × R m → R m , if π S ( x ( ¯ k ) , u L ( ¯ k )) = u L ( ¯ k ) and u ( k ) = π S ( x ( k ) , u L ( k )) keeps the system safe, i.e. (4) is satisfied for all k ≥ 0 . C By assuming that a safety policy can be found for the initial system state, Definition III.3 implies the follo wing safety algorithm. At ev ery time step, safety of a proposed input u L ( k ) is verified using the safety policy according to Definition III.3. If safety cannot be verified, the proposed input is modified and u ( ¯ k ) = π S ( x ( ¯ k ) , u L ( ¯ k )) is applied to the system instead, ensuring safety until the next state and learning input pair can be certified as safe again. The set of initial states for which π S ensures safety can thus be interpreted as a safe set of system states and represents a probabilistic variant of the safe set definition in [12]. In the following, we present a method to compute a safety policy π S for uncertain models of the form (2) making use of model predicti ve control (MPC) techniques, which provide real- time feasibility and scalability of the approach while aiming at a large safe set implicitly defined by the safety policy . 4 I V . P R O BA B I L I S T I C M O D E L P R E D I C T I V E S A F E T Y C E RT I FI C A T I O N The fundamental idea of model predictive safety certification , which was introduced for linear deterministic systems in [ 12 ], is the on-the-fly computation of a safety policy π S that ensures constraint satisfaction at all times in the future. The safety policy is specified using MPC methods, i.e., an input sequence is computed that safely steers the system to a terminal safe set X f , which can be done efficiently in real-time. The first input is selected as the learning input if possible, in which case it is certified as safe, or selected as ‘close’ as possible to the learning input otherwise. A specific choice of the terminal safe set X f allows us to sho w that a pre vious solution at time k − 1 implies the existence of a feasible solution at time k , ensuring safety for all future times. Such terminal sets X f can, e.g., be a neighborhood of a locally stable steady-state of the system (1) , or a possibly conservati ve set of states for which a safe controller is known. A. Nominal model predictive safety certification scheme In order to introduce the basic idea of the presented approach, we introduce a nominal model predictive safety certification (NMPSC) scheme under the simplifying assumption that the system dynamics (1) are perfectly known, time-independent, and without noise, i.e. x ( k + 1) = f ( x ( k ) , u ( k )) ∀ k ∈ I ≥ 0 . The mechanism to construct the safety policy for certifying a gi ven control input is based on concepts from model predictiv e control [ 37 ] as illustrated in Figure 1 (left) with the difference that we aim at certifying an external learning signal u L instead of performing, e.g., safe steady-state or trajectory tracking. The safety policy π S is defined implicitly through optimiza- tion problem (5) , unifying certification and computation of the safety policy . Thereby , problem (5) does not only describe the computation of a safety policy based on the current state x ( k ) and according to Definition (III.3) , but also provides a mechanism in order to modify the learning-based control input u L ( k ) as little as necessary in order to find a safe backup policy for the predicted state at time k + 1 . In (5) , x i | k is the state predicted i time steps ahead, computed at time k , i.e. x 0 | k = x ( k ) . Problem (5) computes an N -step input sequence { u ∗ i | k } satisfying the input constraints U , such that the predicted states satisfy the constraints X and reach the terminal safe set X f after N steps, where N ∈ I ≥ 1 is the prediction horizon. The safety controller π S is defined as the first input u ∗ 0 | k of the computed optimal input sequence dri ving the system to the terminal set, which guarantees safety for all future times via an inv ariance property . Assumption IV .1 (Nominal in variant terminal set) . Ther e exists a nominal terminal invariant set X f ⊆ X and a corresponding contr ol law κ f : X f → U , such that for all x ∈ X f it holds that κ f ( x ) ∈ U and f ( x, κ f ( x )) ∈ X f . C Assumption IV .1 provides recursive feasibility of opti- mization problem (5) and therefore infinite-time constraint satisfaction, i.e., if a feasible solution at time k exists, one exists at k + 1 and therefore at all future times, see i.e. [37]. The safety certification scheme then works as follows. Consider a measured system state x ( k − 1) , for which (5) is feasible and the input trajectory { u ∗ i | k − 1 } is computed. After applying the first input to the system u ( k − 1) = u ∗ 0 | k − 1 , the resulting state x ( k ) is measured again. Because it holds in the nominal case that x ( k ) = x ∗ 1 | k − 1 , a valid input sequence { u ∗ 1 | k − 1 , . . . , u ∗ N − 1 | k − 1 , κ f ( x ∗ N | k − 1 ) } is known from the previous time step, which satisfies constraints and steers the state to the safe terminal set X f , as indicated by the brown trajectory in Figure 1 (left). The safety of a proposed learning input u L is certified by solving optimization problem (5) , which if feasible for u 0 | k = u L , provides the green trajectory in Figure 1 (left) such that u L can be safely applied to the system. Should problem (5) not be feasible for u 0 | k = u L , it returns an alternativ e input sequence that safely guides the system tow ards the safe set X f . The first element of this sequence u ∗ 0 | k is chosen to be as close as possible to u L and is applied to system (1) instead of u L . Due to recursive feasibility , i.e., knowledge of the brown trajectory in Figure 1 (left), such a solution always exists, ensuring safety . In the context of learning-based control, the true system dynamics are rarely known accurately . In order to deriv e a probabilistic v ersion of the NMPSC scheme that accounts for uncertainty in the system model (2) in the following, we lev erage adv ances in probabilistic stochastic model predictive control [38], based on so-called probabilistic reachable sets. B. Pr obabilistic model predictive safety certification scheme In the case of uncertain system dynamics, the safety polic y consists of two components following a tube-based MPC concept [ 37 ]. The first component considers a nominal state of the system z ( k ) driv en by linear dynamics, and computes a nominal safe trajectory { z ∗ i | k , v ∗ i | k } through optimization problem (6) , which is similar to the case of perfectly known dynamics introduced in the previous section, defining the nominal input v ( k ) = v ∗ 0 | k . The second component consists of an auxiliary controller, which acts on the deviation e ( k ) of the true system state from the nominal one and ensures that the true state x ( k ) remains close to the nominal trajectory . Specifically , it guarantees that e ( k ) lies within a set R , often called the ‘tube’, with probability of at least p x at each time step. T ogether , the resulting safety policy is able to steer the system state x ( k ) within the probabilistic tube along the nominal trajectory tow ards the safe terminal set. W e first define the main components and assumptions, in order to then introduce the probabilistic model predicti ve safety certification (PMPSC) problem together with the proposed safety controller . Define with z ( k ) ∈ R n and v ( k ) ∈ R m the nominal system states and inputs, as well as the nominal dynamics according to model (2) as z ( k + 1) = Az ( k ) + B v ( k ) , k ∈ I ≥ 0 (7) with the initial condition z (0) = x (0) . For example, one might choose matrices ( A, B ) in the context of learning time-in variant linear systems based on the maximum likelihood estimate of the true system dynamics. Denote e ( k ) : = x ( k ) − z ( k ) as the error (deviation) between the true system state, ev olving 5 min u i | k ,x i | k u L ( k ) − u 0 | k s.t. ∀ i ∈ I [0 ,N − 1] : x i +1 | k = f ( x i | k , u i | k ) , (5a) x i | k ∈ X (5b) u i | k ∈ U (5c) x N | k ∈ X f (5d) x 0 | k = x ( k ) (5e) X X f x ( k ) x ∗ 2 | k − 1 x ∗ 3 | k − 1 x ∗ 4 | k − 1 f ( x ( k ) , u L ( k )) = x ∗ 1 | k x ∗ 2 | k x ∗ 3 | k x ∗ 4 | k min v i | k ,z i | k u L ( k ) − v 0 | k − K R ( x ( k ) − z 0 | k ) s.t. ∀ i ∈ I [0 ,N − 1] : z i +1 | k = Az i | k + B v i | k (6a) z i | k ∈ X R x , (6b) v i | k ∈ U K R R u (6c) z N | k ∈ Z f (6d) z 0 | k = z ( k ) (6e) X X R x Z f z ( k ) z ∗ 2 | k − 1 z ∗ 3 | k − 1 z ∗ 4 | k − 1 f ( z ( k ) , v L ( k )) = z ∗ 1 | k z ∗ 2 | k z ∗ 3 | k z ∗ 4 | k x ( k ) x L ( k + 1) Fig. 1: Mechanism in order to construct a safety policy ‘on-the-fly’: The system is depicted at time k , with the current backup solution in brown. A proposed learning input u L is certified by constructing a safe solution for the following time step, shown in green. The existence of a safe trajectory is ensured by extending the brown trajectory using Assumptions IV .1 and IV .4 respectiv ely . Left (NMPSC): Safe solutions are computed with respect to the true state dynamics, and constraints x ∈ X are guaranteed to be satisfied. Right (PMPSC): Safe solutions are computed with respect to the nominal state z . The true state lies within the tube around the nominal state with probability p x . By enforcing z ∈ X R x constraint satisfaction holds with at least the same probability . according to (1) , and the nominal system state following (7) . The controller is then defined by augmenting the nominal input with an auxiliary feedback on the error , in the case of a linear system (7) a linear state feedback controller K R u ( k ) = v ( k ) + K R ( x ( k ) − z ( k )) , (8) which keeps the real system state x ( k ) close to the nominal sys- tem state z ( k ) , i.e. keeps the error e ( k ) small, if K R ∈ R m × n is chosen such that it stabilizes system (7) . By Assumption III.1, the model error w θ ( x ( k ) , u ( k )) is contained in W θ for all time steps with probability p θ . Therefore, we drop the state and input dependencies in the following and simply refer to w θ ( k ) as the model mismatch at time k , such that the error dynamics can be expressed as e ( k + 1) = x ( k + 1) − z ( k + 1) = f θ ( x ( k ) , u ( k )) + w s ( k ) − Az ( k ) − B v ( k ) = f θ ( x ( k ) , u ( k )) − Ax ( k ) − B u ( k ) + Ax ( k ) + B u ( k ) + w s ( k ) − Az ( k ) − B v ( k ) = ( A + B K R ) e ( k ) + w θ ( k ) + w s ( k ) . (9) By setting the initial nominal state to the real state, i.e., z (0) = x (0) ⇒ e (0) = 0 , the goal is to keep the evolving error e ( k ) , i.e. the deviation from the nominal reference trajectory , small in probability with levels p x and p u for state and input constraints (4) , respecti vely . This requirement can be formalized using the concept of probabilistic reachable sets introduced in [39], [40], [38]. Definition IV .2. A set R is a probabilistic reachable set (PRS) at probability lev el p for system (9) if e (0) = 0 ⇒ Pr( e ( k ) ∈ R ) ≥ p, (10) for all k ∈ I ≥ 0 . C In Section V -A we show how to compute PRS sets R x , R u , corresponding to state and input chance constraints (4) , in order to fulfill the following Assumption. Assumption IV .3 (Probabilistic tube) . Ther e exists a linear state feedback matrix K R ∈ R m × n that stabilizes system (7) . The corr esponding PRS sets for the error dynamics (9) with pr obability levels p x and p u ar e denoted by R x , R u ⊆ R n . C Based on Assumption IV .3, it is possible to define deter- ministic constraints on the nominal system (7) that capture the chance constraints (4) , by choosing X R x as tightened state constraints, as depicted in Figure 1 (right), in which the grey circles illustrate the PRS centered around the predicted nominal state z , such that they contain the true state x with probability p x . An appropriate tightening of the input constraints is obtained by linearly transforming the error set R u at probability level p u using the linear error feedback K R , resulting in U K R R u . Through calculation of the nominal safety policy towards a safe terminal set within the tightened constraints, and by application of (8) , finite-time chance-constraint satisfaction ov er the planning horizon k + N follows directly by Definition IV .2 and Assumption IV .3. In 6 x ( k ) e ( k ) z ∗ 1 | k − 1 u ( k ) = v ∗ 1 | k − 1 + K R ( x ( k ) − z ∗ 1 | k − 1 ) z ∗ 2 | k − 1 v ∗ 1 | k − 1 { ( A + B K R ) e ( k ) ⊕ W m } u ( k ) = u L ( k ) = v ∗ 0 | k + K R ( x ( k ) − z ∗ 0 | k ) z ∗ 1 | k v ∗ 0 | k { ( A + B K R ) e ( k ) ⊕ W m } Fig. 2: Illustration of the idea underlying the proof of Theorem IV .5 without stochastic noise, i.e. w s ( k ) = 0 , and W θ polytopic. Starting from x ( k ) , the set of possible reachable states for x ( k + 1) under the safety policy u ( k ) = v ∗ 1 | k − 1 + K Ω ( x ( k ) − z ∗ 1 | k − 1 ) from the previous time step is indicated by the three dotted black arrows. The corresponding predicted error set with respect to the nominal system is giv en by { ( A + B K R ) e ( k ) ⊕ W θ } as shown in red. Solving (6) yields the optimal input u ( k ) = v ∗ 0 | k + K Ω ( x ( k ) − z ∗ 0 | k ) , which preserves the predicted error set, enabling us to probabilistically bound the error within the PRS R x , R u from Assumption IV .3. order to provide ‘infinite’ h ori zon safety through recursiv e feasibility of (6) , we require a terminal in v ariant set for the nominal system state Z f similar to Assumption IV .1, which is contained in the tightened constraints. Assumption IV .4 (Nominal terminal set) . Ther e exists a terminal in variant set Z f ⊆ X R x and a corr esponding contr ol law κ f : Z f → U K R R u such that for all z ∈ Z f it holds that κ f ( z ) ∈ U K R R u and Az + B κ f ( z ) ∈ Z f . C The generation of the terminal set Z f based on collected measurement data is discussed in Section V -C , allowing for a successiv e improvement of the overall PMPSC performance. In classical tube-based and related stochastic MPC methods, the nominal system (7) is re-initialized at each time step in order to minimize the nominal objectiv e, which causes a reset of the corresponding error system. While this works in a robust setting, it prohibits a direct probabilistic analysis using PRS according to Definition IV .2, that only provides statements about the autonomous error system, starting from time k = 0 and ev olving linearly for all future times. Consequently and in contrast to classical formulations, we compute (7) via (6e) , which leads to the error dynamics (9) despite online replanning of the nominal trajectory at each time step compared also to Figure 2 accompanying the proof of Theorem IV .5. Building on the tube-based controller structure, an input is certified as safe if it can be represented in the form of (8) by selecting v ∗ 0 accordingly . Otherwise an alternati ve input is provided ensuring that Pr( e ( k ) ∈ R x ) ≥ p x , Pr( e ( k ) ∈ R u ) ≥ p u for all k ∈ I ≥ 0 . Combining this mechanism with the assumptions from above yield the main result of the paper . Theorem IV .5. Let Assumptions IV .3 and IV .4 hold. If (6) is feasible for z (0) = x (0) , then system (1) under the control law (8) with v ( k ) = v ∗ 0 | k r esulting from the PMPSC pr oblem (6) is safe for all u L ( k ) and for all times, i.e., the chance constraints (4) are satisfied for all k ≥ 0 . C Pr oof. W e begin by in vestig ating the error dynamics under (8) . By (6e) it follows that e ( k ) ev olves for all k ∈ I ≥ 0 , despite re- optimizing v i | k , z i | k , based on u L ( k ) according to (6) at ev ery time step, see also Figure 2. Therefore Pr( e ( k ) ∈ R x ) ≥ p x and Pr( e ( k ) ∈ R u ) ≥ p u for all k ∈ I ≥ 0 by Assumption IV .3 and z (0) = x (0) . Next, Assumption IV .4 provides recursive feasibility of optimization problem (6) , i.e. if a feasible solution at time k exists, one will always exist at k + 1 , specifically { v ∗ 1 | k , . . . , v ∗ N − 1 | k , κ f ( z ∗ N | k ) } is a feasible solution, which implies feasibility of (6) for all k ≥ 0 by induction. Finally , by recursiv e feasibility it follows that z ( k ) ∈ X R x and v ( k ) ∈ U K R R u for all k ∈ I ≥ 0 , implying in combination with Pr( e ( k ) ∈ R x ) ≥ p x and Pr( e ( k ) ∈ R u ) ≥ p u for all k ∈ I ≥ 0 that Pr { x ( k ) = z ( k ) + e ( k ) ∈ X } ≥ p x and Pr { u ( k ) = v ( k ) + K R e ( k ) ∈ U } ≥ p u for all k ∈ I ≥ 0 . W e therefore prove that if (6) is feasible for z (0) = x (0) , (8) will always provide a control input such that constraints (4) are satisfied. Remark IV .6 (Recursiv e feasibility despite unbounded distur- bances) . V arious recent stochastic model predictive contr ol appr oaches, which consider chance constraints in the presence of unbounded additive noise, ar e also based on constraint tightening (see [ 41 ], [ 42 ], [ 43 ], [ 38 ]). The online MPC pr oblem in these formulations can become infeasible due to the unbounded noise, which is compensated for by employing a r ecovery mechanism. In contrast, the proposed formulation offer s inher ent recursive feasibility , even for unbounded distur- bance r ealizations w s ( k ) , by simulating the nominal system and 7 ther efore not optimizing over z 0 . While optimization over z 0 usually enables state feedback in tube-based stochastic model pr edictive contr ol methods, we incorporate state feedback thr ough the cost function of (6) . A similar strate gy can also be used for recur sively feasible stochastic MPC schemes, as pr esented in [44]. C C. Discussion: Safe RL with pr edictive safety certification While the application of RL algorithms together with the presented PMPSC scheme for safety can be viewed as learning to control an inherently safe system of the form x ( k + 1) = f θ ( x ( k ) , π S ( k , x ( k ) , ˜ u ( k ))) + w s ( k ) , ∀ k ∈ I ≥ 0 with virtual inputs ˜ u ( k ) ∈ R m , the underlying safety mecha- nism π S can introduce significant additional nonlinearities into the system to be optimized, which may cause slo wer learning con vergence. In the following, we therefore outline practical and theoretical aspects to retain learning con vergence, when using RL in conjunction with the presented PMPSC scheme to ensure safety . A simple mechanism to alleviate this limitation has been in- troduced in the con te xt of general safety frameworks [ 45 ] and is based on adding a term of the form ( u ( k ) − u L ( k )) > R S ( u ( k ) − u L ( k )) , R S 0 to the ov erall learning objectiv e, which accounts for deviations between the proposed and the applied input to system (1) , implying a lar ger cost for unsafe policies. While this can encourage learning conv ergence in practice, many successful RL algorithms such as Trust-Re gion-Policy- Optimization (TRPO) [ 46 ], deterministic actor-critic polic y search [ 47 ] or Bayesian Optimization [ 48 ], [ 49 ] that provide theoretical properties in terms of approximate gradient steps or cumulativ e regret, require continuity of the applied control policy . More precisely , continuity of the applied control policy is often a prerequisite to adjust the learning-based policy via approximate gradient directions of the overall learning objectiv e and is typically required to establish regularization properties of the corresponding value function. Using results from explicit model predictiv e control, it is indeed possible to establish that the PMPSC problem and therefore the resulting safe learning-based policy π S is continuous in x ( k ) and u L ( k ) , see e.g. [ 50 , Theorem 1.4]. Furthermore, by using the results from [ 51 ], it is ev en possible to compute the partial deri vati ves of (6) with respect to the learning input, i.e., ∂ ∂ u L π S ( k , x, u L ) , allowing explicit consideration of the effects of safety-ensuring actions during closed-loop control for efficient policy parameter updates. In brief, due to the con vexity of (6) , continuity properties of the policy and v alue function are typically preserved, and it is possible to obtain comparable con vergence properties of RL algorithms in combination with PMPSC, as illustrated in Figure 6 in the numerical example section. Note that ev en under these considerations the combination of a potentially unsafe RL algorithm with the PMPSC scheme presented can cause performance to deteriorate compared to the plain application of a potentially unsafe RL algorithm, which is, howe ver , conceptually inevitable when restricting the space of possible inputs in fav or of safety guarantees. V . D A T A B A S E D D E S I G N In order to employ the PMPSC scheme the following components must be provided: The model description (2) of the true system (1) (Section V -A ), the probabilistic error tubes R x and R u based on the model (2) according to Assumption IV .3 (Section V -B ), and the nominal terminal set Z f , which provides recursive feasibility according to Assumption IV .4 (Section V -C ). In this section, we present efficient techniques for computing those components that are tailored to the learning-based control setup by solely relying on the av ailable data collected from the system. A. Model design For simplicity , we focus on systems described by linear Bayesian regression [52], [53] of the form x ( k + 1) = θ > ( x u ) + w s ( k ) (11) with an unknown parameter matrix θ ∈ R n × n + m , which is inferred from noisy measurements D = { ( x i , u i ) , y i } N D i =1 with y k = f θ ( x ( k ) , u ( k )) + w s ( k ) , w s ( k ) ∼ Q w s , using a prior distribution Q θ on the parameters θ . Note that distrib ution pairs Q w s and Q θ that allow for efficient posterior computation, e.g. Gaussian distributions, usually exhibit infinite support, i.e. w s ( k ) ∈ R n , which can generally not be treated robustly using, e.g., the related method presented in [ 12 ]. The correct selection of the parameter prior and process noise distributions Q θ and Q w s require careful model selection techniques that are beyond the scope of this paper , see, e.g., [ 52 , Section 2.3] and [ 53 , Chapter 3]. In addition, we assume that the measurements D are sufficiently information-rich, e.g., that they have been generated using excitation signals as described in [34]. In the following, we present one way of obtaining the required model error set W θ using confidence sets based on the posterior distribution Q θ |D . W e start by describing the set of all realizations of (11) , which contain the true system with probability p θ . T o this end let the confidence region at probability level p θ of the random vector θ ∼ Q θ |D , denoted by E p θ ( Q θ |D ) , be defined such that Pr( θ ∈ E p θ ( Q θ |D )) ≥ p θ (12) and compare the corresponding set of system dynamics with the expected system dynamics, which is in the considered case giv en by E ( θ ) > | {z } =:[ A,B ] x ( k ) u ( k ) . (13) Note that the model error between (11) and (13) is unbounded by definition, if we consider an unbounded domain as required by (3) since lim k ( x,u ) k 2 →∞ || (( A, B ) − ˜ θ > )( x, u ) || 2 = ∞ for any ˜ θ ∈ E p θ ( Q θ |D ) such that ˜ θ 6 = θ . W e therefore make the practical assumption that the model error is bounded outside a suf ficiently large ‘outer’ state and input space X o × U o ⊇ X × U , as illustrated in Figure 3, which relates to Assumption III.1 as follows. 8 Assumption V .1 (Bounded model error) . The set ˜ W θ : = { w m ∈ R n |∀ ( x, u, θ ) ∈ X o × U o × E p θ ( Q θ |D ) Ax + B u + w m = θ > ( x u ) } . is an overbound of W θ accor ding to Assumption III.1, i.e. W θ ⊆ ˜ W θ . C A simple but efficient computation scheme for ov erapproxi- mating ˜ W θ using E p θ ( Q θ |D ) can be developed for the special case of a Gaussian prior distribution col i ( θ ) ∼ N (0 , Σ θ i ) and Gaussian distributed process noise w s ( k ) ∼ N (0 , I n σ 2 s ) . W e begin with the posterior distribution Q θ |D of θ conditioned on data D , giv en by p (col i ( θ ) |D ) = N ( σ − 2 s C − 1 i X col i ( y ) , C − 1 i ) , (14) where ro w i ( X ) = φ ( x i , u i ) > , ro w i ( y ) = y > i , and C i = σ − 2 s X X > + (Σ θ i ) − 1 , see, e.g. [54], [52]. Using the posterior distribution Q θ |D according to (14) we compute a polytopic outer approximation of E p θ ( Q θ |D ) in a second step, which can be used in order to finally obtain an approximation of ˜ W θ and therefore W θ since W ⊆ ˜ W θ by Assumption V .1. T o this end, we consider the vectorized model parameters v ec( θ ) and their confidence set E p θ ( Q vec( θ ) ) = { v ec( θ ) ∈ R n 2 + mn | v ec( θ ) > C vec( θ ) ≤ χ 2 n 2 + mn ( p θ ) } , where χ 2 n 2 + mn is the chi-squared distribution of degree n 2 + mn and C := C 1 0 · · · 0 0 C 2 · · · 0 . . . . . . . . . 0 0 . . . C n (15) is the posterior cov ariance according to (14) . A computationally cheap outer approximation of E p θ ( Q vec( θ ) ) can be obtained by picking its major axes { ˜ θ i } n 2 + mn i =1 using singular v alue decom- position of C , which provide the vertices of an inner polytopic approximation co( { ˜ θ i } n 2 + mn i =1 ) of E p θ ( Q vec( θ ) ) . Scaling this inner approximation by √ n 2 + mn [55] yields vertices of an outer polytopic approximation of E p θ ( Q vec( θ ) ) giv en by the con vex hull co( { θ i } n 2 + mn i =1 ) with θ i = √ n 2 + mn ˜ θ i . Based on this outer approximation of E p θ ( Q vec( θ ) ) , it is possible to compute a corresponding outer approximation of ˜ W θ as follows. Due to the con ve xity of co( { θ i } n 2 + mn i =1 ) , it is sufficient to impose θ > i ( x u ) − ( Ax + B u ) ∈ ˜ W θ (16) for all i ∈ I [1 ,n 2 + mn ] , x ∈ X o , u ∈ U o , since by definition of E p θ ( Q vec( θ ) ) and Assumption V .1 we ha ve with probability at least p θ that λ i ( x, u ) ≥ 0 with P n 2 i =1 λ i ( x, u ) = 1 exists such that f θ ( x, u ) = θ ( x, u ) > ( x u ) = n 2 X i =1 λ i ( x, u ) θ > i ( x u ) ∈ n 2 X i =1 λ i ( x, u ) Ax + B u ⊕ ˜ W θ ∈ { Ax + B u } ⊕ ˜ W θ . x ∂ X ∂ X ∂ X o ∂ X o Ax f ( x ) Fig. 3: Bounded error (dashed red lines) between the nominal model Ax (blue line) in (2) and a sample of the true dynamics f θ ( x ) (gray line) beyond the outer bound ∂ X o of the state space X according to Assumption V .1. Therefore, (16) can be used in order to construct an outer approximation ˜ W θ = { w ∈ R n | || w || 2 ≤ w max } , where w max := max i ∈I [1 ,n 2 + mn ] max x ∈X o ,u ∈U o ( θ > i − ( A B )) x u 2 (17) with ( A B ) := E ( θ ) > , { θ i } n 2 + mn i =1 = √ 2 { ˜ θ i } n 2 + mn i =1 and ˜ θ i the major axes of E p θ ( Q vec( θ ) ) . B. Calculation of R for uncertain linear dynamics and unbounded disturbances In this subsection, we provide a method to compute a PRS set R with pre-specified probability level p according to Assumption IV .3 that can be used for obtaining both, a PRS R x at probability lev el p x corresponding to the state constraints (4) , and a PRS R u at probability le vel p u for input constraints, respectiv ely . As is common in the conte xt of related MPC methods, the computations are based on choosing a stabilizing tube controller K R in (8) first, e.g. using LQR design, in order to subsequently efficiently compute the PRS R x and R u . The proposed PRS computation distinguishes between an error resulting from a model mismatch between (1) and (2) , and an error caused by possibly unbounded process noise due to w s ( k ) ∼ Q w s . The error system (9) admits a decomposition e ( k ) = e m ( k ) + e s ( k ) with e (0) = e m (0) = e s (0) = 0 and e m ( k ) = ( A + B K R ) e m ( k ) + w θ ( k ) , (18) e s ( k ) = ( A + B K R ) e s ( k ) + w s ( k ) . (19) While the fact that Q w s is known allo ws us to compute a PRS with respect to e s ( k ) as described in V -B a) , the model error e m ( k ) is state and input dependent with unkno wn distribution. Therefore, we bound e m robustly in probability using the concept of robust inv ariance, i.e., a robustly positiv e in variant set accounts deterministically for all possible model mismatches w θ ( k ) ∈ W θ at probability level p θ according to the following definition. Definition V .2. A set E is said to be a robustly positi ve in variant set (RIS) for system (18) if e m (0) ∈ E implies that e m ( k ) ∈ E for all k ∈ I ≥ 0 . It is called a RIS at probability level p θ if Pr( E is RIS ) ≥ p θ . C 9 This enables us to state the following lemma for the cumulated error e ( k ) according to (9). Lemma V .3. If E is a RIS at pr obability level p θ for the model err or system (18) and R s is a PRS for the disturbance err or system (19) at pr obability level p s , then R = E ⊕ R s is a PRS for the cumulated err or system (9) at pr obability level p θ p s . C Pr oof. By the definition of the Minko wski sum, e m ( k ) ∈ E and e s ( k ) ∈ R s implies e ( k ) ∈ R for any k ∈ I ≥ 0 . Choosing e s (0) = e m (0) = 0 yields for all k ∈ I ≥ 0 Pr( e ( k ) ∈ R ) ≥ Pr( e m ( k ) ∈ E ∧ e s ( k ) ∈ R s ) = Pr( e m ( k ) ∈ E ) Pr( e s ( k ) ∈ R s ) ≥ p θ p s , due to independence, which proves the result. Lemma V .3 allows computation of the PRS R s that accounts for the stochastic disturbances (19) independently of the RIS E , dealing with the model uncertainty ˜ W . In the following we present one option for determining R s and refer to [ 38 ] for further computation methods. Then an optimization problem for the synthesis of E is giv en based on the model obtained from Section V -A. a) PRS R s for stochastic err ors: Using the variance v ar( Q w s ) and the Chebyshev bound, a popular way to compute R s is given by solving the L yapunov equation A > cl Σ ∞ A cl − Σ ∞ = − v ar( Q w s ) for Σ ∞ , which yields the PRS R s = { e s ∈ R n | e > s Σ ∞ e s ≤ ˜ p } (20) with probability level p = 1 − n x / ˜ p , see e.g. [ 38 ]. Furthermore, if Q w s is a normal distribution, R s with ˜ p = χ 2 n ( p ) is a PRS of probability level p , where χ 2 n ( p ) is the quantile function of the n -dimensional chi-squared distribution. b) RIS E at pr obability level p θ for ellipsoidal model err ors: Giv en a bound on the model error according to Assumption III.1 of the form W θ = { w ∈ R n | w > Q − 1 w ≤ 1 } with Q ∈ S n ++ , e.g. Q − 1 := I n w − 2 max using (17) , we can make use of methods from robust control [ 56 ] in order to construct a possibly small set E = { e | e > P e ≤ α } at probability lev el p θ by solving max α − 1 ,τ 0 ,τ 1 α − 1 (21a) s.t. : A > cl P A cl − τ 0 P A > cl P P A cl P − τ 1 Q − 1 0 (21b) 1 − τ 0 − ¯ pτ 1 α − 1 ≥ 0 , (21c) τ 0 , τ 1 > 0 , (21d) where P ∈ S n ++ has to be pre-selected using, e.g., the infinite horizon LQR cost x ( k ) > P x ( k ) , corresponding to the LQR feedback u ( k ) = K R x ( k ) . As pointed out in [ 57 ], optimization problem (21) has a monotonicity property in the bilinearity τ 1 α − 1 such that it can be ef ficiently solved using a bisection on the variable α − 1 . A more advanced design procedure, yielding less conservativ e robust in variant sets, can be found, e.g., in [58]. In summary , based on the uncertainty of the system param- eters with respect to their true values inferred from data and by solving (21) , we obtain a RIS E at probability level p θ , which contains the model error (18) . T ogether with the PRS R s from Section V -B a) , Lemma V .3 provides the overall PRS for the error system (9) , which is gi ven by R = R s ⊕ E at probability lev el p s p θ . Note that the ratio between p θ and p s can be freely chosen in order to obtain overall tubes R x , R u at probability lev els p x = p R x s p R x θ and p u = p R u s p R u θ according to the chance constraints (4). C. Iterative construction of the terminal safe set Z f While the terminal constraint (6d) in combination with Assumption IV .4 is ke y in order to provide a safe backup control policy π S for all future times, it can restrict the feasible set of (6) . The goal is therefore to provide a large terminal set Z f yielding potentially less conservati ve modifications of the proposed learning-based control input u L according to (6). This can be iterati vely achieved by rec ycling previously calculated solutions to (6) , starting from a potentially conser- vati ve initial terminal set Z f according to Assumption IV .4. Such an initialization can be computed using standard in variant set methods for linear systems, see e.g. [ 59 ] and references therein. Note that the underlying idea of iterativ ely enlarging the terminal set is related to the concepts presented, e.g., in [ 60 ], [61]. Let the set of nominal predicted states obtained from successfully solved instances of (6) be denoted by z ∗ ( k ) = { z ∗ j ( x ( i )) , i ∈ I [1 ,k ] , j ∈ I [0 ,N ] } . Proposition V .4. If Assumption IV .4 holds for Z f and (6) is con vex, then the set Z k f : = co( z ∗ ( k )) ∪ Z f (22) satisfies Assumption IV .4. C Pr oof. W e proceed in a manner similar to the proof of [ 12 , Theorem IV .2]. Let z ∈ Z k f and note that if (6) is con vex, then the feasible set is a con vex set, see e.g. [ 55 ], and therefore co( z ∗ ( k )) is a subset of the feasible set. From here, together with the fact that the system dynamics are linear , it follo ws for z ∈ co( z ∗ ( k )) that multipliers λ ij ≥ 0 , Σ i,j λ ij = 1 exist such that we have z = Σ i,j λ ij z ∗ j ( x ( i )) with corresponding input Σ i,j λ ij v ∗ j ( x ( i )) that satisfies state and input constraints due to the con vexity of these sets. W e can therefore explicitly state κ z ∗ ( k ) f ( z ) = ( Σ i,j λ ij v ∗ j ( x ( i )) , if z ∈ co( z ∗ ( k )) , κ f ( z ) , else as the required nominal terminal control law according to Assumption IV .4 since Az + B κ z ∗ ( k ) f ( z ) ∈ Z k f follows from con vexity of co( z ∗ ( k )) and in variance of Z f . Noting that co( z ∗ ( k )) ⊆ X R x and v ∗ k | 0 ⊆ U K R R u by (6b) , (6c) shows that for all z ∈ Z k f a control law ¯ κ f exists according to Assumption IV .4, which completes the proof. D. Overall MPSC design pr ocedur e Giv en the methods presented in this section, the MPSC problem synthesis from data can be summarized as follows. 10 Step 1: Compute a linear dynamical model of the form (2) based on available measurements D by estimating θ in (11) using (14). Step 2: Compute a polytopic confidence set of θ using a singular value decomposition of (15) to obtain the model uncertainty set according to (17). Step 3: Compute the PRS R s,x , R s,u corresponding to the additiv e stochastic uncertainty according to (20). Step 4: Compute the RIS E x , E u at the desired probability lev els based on the model uncertainty by solving (21) . Step 5: Perform the state and input constraint tightening with respect to the Minkowski sums R x = R s,x ⊕ E x and K R R u = K R {R s,u ⊕ E u } to obtain (6b) and (6c). Step 6: Initialize Z f = { 0 } (principled ways in order to calculate less restrictiv e Z f can be found for example in [59]). Using these ingredients, any potentially unsafe learning- based control law u L ( k ) can be safeguarded by solving (6) and applying (8) at ev ery time step k . When the constraint tightening (6b) , (6c) or the nominal terminal set constraint (6d) are overly conservati ve, it is possible to make use of the system trajectories during closed-loop operation to improve the performance. Collected state measurements can be used to reduce model uncertainty , allowing tighter bounds on w m and recomputation of R x , R u that enables greater exploration of the system in the future as demonstrated in Figure 8. In addition, nominal trajectories can be used to enlarge Z f according to (22). V I . N U M E R I C A L E X A M P L E : S A F E LY L E A R N I N G T O C O N T RO L A C A R In this section, we apply the proposed PMPSC scheme in order to safely learn ho w to drive a simulated autonomous car along a desired trajectory without leaving a narrow road. For the car simulation we consider the dynamics ˙ x = v cos( ψ ) ˙ ψ = ( v /L ) tan( δ ) 1 1+( v /v CH ) ˙ y = v sin( ψ ) ˙ δ = (1 /T δ )( u δ − δ ) ˙ v = a ˙ a = (1 /T a )( u a − a ) , (23) with position ( x, y ) in world coordinates, orientation ψ , velocity v , acceleration a , and steering angle δ , where the acceleration rate is modeled by a first-order lag with respect to the desired acceleration (system input) u a , and the angular velocity of the steering angle is also modeled by a first-order lag with respect to the desired steering angle (system input) u δ . The system is subject to the state and input constraints || δ || ≤ 0 . 7 [rad] , || v || ≤ 19 . 8 [m s − 1 ] , − 6 ≤ a ≤ 2 [m s − 2 ] , || u δ || ≤ 1 . 39 [rad] , and − 6 ≤ u a ≤ 2 [m s − 2 ] , for which the true car dynamics can be approximately represented by (23), see e.g. [62], with parameters T δ = 0 . 08 [s] , T a = 0 . 3 [s] , L = 2 . 9 [m] , and v CH = 20 [m s − 2 ] . The system is discretized with a sampling time of 0 . 1 [s] . The learning task is to find a control law that tracks a periodic reference trajectory on a narrow road, which translates in an additional safety constraint || y || ≤ 1 . The terminal set Fig. 4: Learning to track a periodic trajectory subject to constraints: here we show the first 30 learning episodes without (red) and with (green) the proposed PMPSC framew ork as well as the safety constraints (black). according to Assumption IV .4 is defined as the road center with angles ψ = δ = 0 and acceleration a = 0 , which is a safe set for (23) with κ f = 0 . The planning horizon is selected to N = 30 and the model (2) as well as the PRS set R x = R u with probability level 98% is computed based on a 30 second state and input trajectory according to Section V, see supplementary material for further details. W e use Bayesian Optimization as described in [ 63 ] for learning a linear control law , implementing a policy search method that automatically trades off exploration of the parameter space and exploitation of promising subsets and which does not provide inherent safety guarantees. As the cost function for each episode, we penalize the deviation from the reference trajectory quadratically , i.e., 60 X i =1 ( x ref ( i ) − x ( i )) > Q ( x ref ( i ) − x ( i )) + u L ( i ) Ru L ( i ) , where Q = diag((1 1 . 5 1 1 100 100) > ) and R = diag((1 1) > ) . While the resulting learning episodes without the PMPSC framew ork would leave the safety constraints, i.e. the road, in a significant number of samples, as shown in Figure 4, the safety framew ork enables safe learning in every episode. In Figure 5, two example learning episodes from Figure 4 are shown, where the size of the input modification through the safety framework is indicated with dif ferent circle radii along the trajectories. While in the first episode the safety framew ork interv enes with the learning-based policy in order to ensure safety , the algorithm safely begins to conv erge after 30 episodes with significantly less safety interventions. In Figure 6 we compare the performance of the learning- based control policy when applied directly against the per- formance of the safety-enhanced policy using PMPSC and observe that the safety-ensuring actions yield a slightly slower con vergence and slightly worse performance after learning con vergence on a verage compared to direct application of the unsafe algorithm. V I I . C O N C L U S I O N This paper has introduced a methodology to enhance arbitrary RL algorithms with safety guarantees during the process of learning. The scheme is based on a data-driv en, linear belief approximation of the system dynamics that is used in order to compute safety policies for the learning-based controller 11 Fig. 5: Resulting safe closed-loop trajectories during learning with initial policy parameters (blue) and final policy parameters (green). The circle radii indicate the relative magnitude of safety-ensuring modifications of the learning-based controller . Fig. 6: Closed-loop cost for 100 different experiments. Thin lines depict experiment samples and thick lines show the corresponding mean. Red lines indicate direct application of the learning-based controller and green lines illustrate the combination with the proposed PMPSC scheme. ‘on-the-fly’. By proving the existence of a safety policy at all time steps, safety of the closed-loop system is established. Principled design steps for the scheme are introduced, based on Bayesian inference and conv ex optimization, which require little expert system kno wledge in order to realize safe RL applications. A P P E N D I X A. Details of numerical example The model is computed according to Section V -A based on measurements of system (23) as depicted in Figure 7, sensor noise σ s = 0 . 01 and prior distribution Σ p i = 10 I n . The state feedback K R = − 1 . 25 − 0 . 05 − 2 . 34 − 0 . 75 − 0 . 19 0 . 02 0 . 02 − 0 . 69 − 0 . 03 0 . 02 − 5 . 04 − 2 . 85 according to Assumption IV .3 is computed according to the mean dynamics of (13) using LQR design. Applying the procedure described in Section V -B with different numbers of measurements as shown in Figure 7 yield different constraint tightenings of the interval state and input constraints as depicted in Figure 8. The final tightened input and state constraints that are used in the numerical example in Section VI are giv en as || u δ || ≤ 1 . 39 [rad] , − 5 . 4 ≤ u a ≤ 1 . 3 [m s − 2 ] , || y || ≤ 0 . 87 [m] , || δ || ≤ 0 . 64 [rad] , || v || ≤ 19 . 97 [m s − 1 ] , − 4 . 97 ≤ a ≤ 0 . 44 [m s − 2 ] , computed Fig. 7: Measurements of system (23) , which are used for computing the PIS set R . Fig. 8: T ightening of state and input interv al constraints for different numbers of measurements used to the design the PMPSC scheme. using the Y almip-toolbox [ 64 ] together with MOSEK [ 65 ] to solve the resulting semi-definite program. Starting from a terminal set Z f = { 0 } we illustrate in Figure 9 how the volume of Z f can iterativ ely be enlarged based on previously calculated nominal state trajectories at each time step by following Proposition V .4 to reduce conservatism of the terminal constraint (6d). R E F E R E N C E S [1] V . Mnih, K. Kavukcuoglu, D. Silver , A. A. Rusu, J. V eness, M. G. Bellemare, A. Grav es, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-lev el control through deep reinforcement learning, ” Natur e , vol. 518, no. 7540, p. 529, 2015. [2] J. Merel, Y . T assa, S. Srinivasan, J. Lemmon, Z. W ang, G. W ayne, and N. Heess, “Learning human behaviors from motion capture by adversarial imitation, ” arXiv preprint , 2017. [3] K. H. Ang, G. Chong, and Y . Li, “Pid control system analysis, design, and technology , ” IEEE transactions on contr ol systems technology , vol. 13, no. 4, pp. 559–576, 2005. [4] S. Skogestad and I. Postlethwaite, Multivariable feedback contr ol: analysis and design . Wile y Ne w Y ork, 2007, vol. 2. [5] S. J. Qin and T . A. Badgwell, “ An overview of nonlinear model predictive control applications, ” in Nonlinear model predictive contr ol . Springer , 2000, pp. 369–392. 12 Fig. 9: V olume of the terminal set Z f according to (22) ov er learning episodes. [6] J. H. Lee, “Model predictive control: Revie w of the three decades of dev elopment, ” International Journal of Contr ol, Automation and Systems , vol. 9, no. 3, p. 415, 2011. [7] D. Amodei, C. Olah, J. Steinhardt, P . Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety , ” arXiv preprint arXiv:1606.06565 , 2016. [8] Y . W ang and S. Boyd, “Fast model predictive control using online optimization, ” IEEE T ransactions on Contr ol Systems T echnology , vol. 18, no. 2, pp. 267–278, March 2010. [9] A. Domahidi, A. U. Zgraggen, M. N. Zeilinger, M. Morari, and C. N. Jones, “Ef ficient interior point methods for multistage problems arising in receding horizon control, ” Dec 2012, pp. 668–674. [10] B. Houska, H. Ferreau, and M. Diehl, “ACADO Toolkit – An Open Source Framework for Automatic Control and Dynamic Optimization, ” Optimal Contr ol Applications and Methods , vol. 32, no. 3, pp. 298–312, 2011. [11] J. F . Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. T omlin, “ A general safety framework for learning-based control in uncertain robotic systems, ” IEEE T ransactions on Automatic Contr ol , 2018. [12] K. P . W abersich and M. N. Zeilinger, “Linear model predictive safety certification for learning-based control, ” in 2018 IEEE Conference on Decision and Control (CDC) , 2018, pp. 7130–7135. [13] T . Gurriet, M. Mote, A. D. Ames, and E. Feron, “ An online approach to active set in variance, ” in 2018 IEEE Conference on Decision and Contr ol (CDC) . IEEE, 2018, pp. 3592–3599. [14] T . Mannucci, E. J. van Kampen, C. de V isser, and Q. Chu, “Safe exploration algorithms for reinforcement learning controllers, ” IEEE T ransactions on Neural Networks and Learning Systems , vol. PP , pp. 1–13, 2018. [15] O. Bastani, “Safe reinforcement learning via online shielding, ” arXiv pr eprint arXiv:1905.10691 , 2019. [16] J. García and F . Fernández, “ A comprehensiv e surve y on safe reinforce- ment learning, ” Journal of Machine Learning Researc h , vol. 16, pp. 1437–1480, 2015. [17] J. Achiam, D. Held, A. T amar, and P . Abbeel, “Constrained policy optimization, ” in International Conference on Machine Learning , 2017, pp. 22–31. [18] J. Schulman, S. Levine, P . Abbeel, M. Jordan, and P . Moritz, “Trust region policy optimization, ” in International Conference on Machine Learning , 2015, pp. 1889–1897. [19] K. P . W abersich and M. T oussaint, “ Automatic testing and minimax optimization of system parameters for best worst-case performance, ” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IR OS) , Sept 2015, pp. 5533–5539. [20] F . Berkenkamp and A. P . Schoellig, “Safe and robust learning control with gaussian processes, ” in 2015 Eur opean Contr ol Conference (ECC) , 2015, pp. 2496–2501. [21] J. Schreiter, D. Nguyen-T uong, M. Eberts, B. Bischoff, H. Markert, and M. T oussaint, “Safe exploration for active learning with gaussian processes, ” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer , 2015, pp. 133–149. [22] F . Berkenkamp, R. Moriconi, A. P . Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes, ” Dec 2016, pp. 4661–4666. [23] F . Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model- based reinforcement learning with stability guarantees, ” in Advances in Neural Information Processing Systems 30 , I. Guyon, U. V . Luxburg, S. Bengio, H. W allach, R. Fergus, S. V ishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 908–918. [24] A. Aswani, H. Gonzalez, S. S. Sastry , and C. T omlin, “Provably safe and robust learning-based model predictiv e control, ” Automatica , vol. 49, no. 5, pp. 1216 – 1226, 2013. [25] P . Bouffard, A. Aswani, and C. T omlin, “Learning-based model predictive control on a quadrotor: Onboard implementation and e xperimental results, ” in 2012 IEEE International Conference on Robotics and Automation , May 2012, pp. 279–284. [26] C. J. Ostafew , A. P . Schoellig, and T . D. Barfoot, “Robust constrained learning-based nmpc enabling reliable mobile robot path tracking, ” The International Journal of Robotics Researc h , vol. 35, no. 13, pp. 1547– 1563, 2016. [27] T . K oller , F . Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictive control for safe exploration, ” in 2018 IEEE Conference on Decision and Control (CDC) . IEEE, 2018, pp. 6059–6066. [28] U. Rosolia, X. Zhang, and F . Borrelli, “Robust learning model predictive control for iterative tasks: Learning from experience, ” in 2017 IEEE Confer ence on Decision and Contr ol (CDC) , 2017, pp. 1157–1162. [29] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search, ” in Pr oceedings of the 28th International Conference on machine learning (ICML) , 2011, pp. 465– 472. [30] J. H. Gillula and C. J. T omlin, “Guaranteed safe online learning of a bounded system, ” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on , 2011, pp. 2979–2984. [31] K. P . W abersich and M. N. Zeilinger , “Scalable synthesis of safety certificates from data with application to learning-based control, ” in 2018 Eur opean Control Conference (ECC) , 2018, pp. 1691–1697. [32] R. B. Larsen, A. Carron, and M. N. Zeilinger, “Safe learning for distributed systems with bounded uncertainties, ” 20th IF AC W orld Congr ess , vol. 50, no. 1, pp. 2536 – 2542, 2017. [33] P . Wieland and F . Allgöwer, “Constructive safety using control barrier functions, ” IF A C Pr oceedings V olumes , vol. 40, no. 12, pp. 462–467, 2007. [34] L. Ljung, “System identification, ” in Signal analysis and prediction . Springer , 1998, pp. 163–173. [35] S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits, ” in International Confer ence on Machine Learning (ICML) , 2017, pp. 844–853. [36] A. Lederer , J. Umlauft, and S. Hirche, “Uniform error bounds for gaussian process regression with application to safe control, ” in Advances in Neural Information Pr ocessing Systems 32 , 2019, pp. 659–669. [37] J. B. Rawlings and D. Q. Mayne, Model predictive contr ol: Theory and design . Nob Hill Pub., 2009. [38] L. Hewing and M. N. Zeilinger , “Stochastic model predictiv e control for linear systems using probabilistic reachable sets, ” in 2018 IEEE Confer ence on Decision and Contr ol (CDC) , 2018, pp. 5182–5188. [39] G. Pola, J. L ygeros, and M. D. Di Benedetto, “In variance in stochastic dynamical control systems, ” in International Symposium on Mathematical Theory of Networks and Systems , 2006. [40] A. Abate, M. Prandini, J. L ygeros, and S. Sastry , “Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems, ” Automatica , vol. 44, no. 11, pp. 2724 – 2734, 2008. [41] M. Farina, L. Giulioni, L. Magni, and R. Scattolini, “A probabilistic approach to Model Predictiv e Control, ” 2013 IEEE Confer ence on Decision and Control (CDC) , pp. 7734–7739, 2013. [42] M. Farina, L. Giulioni, and R. Scattolini, “Stochastic linear Model Predictiv e Control with chance constraints - A revie w, ” J ournal of Pr ocess Contr ol , vol. 44, pp. 53–67, 2016. [43] J. A. Paulson, E. A. Buehler, R. D. Braatz, and A. Mesbah, “Stochastic model predictive control with joint chance constraints, ” International Journal of Contr ol , vol. 0, no. 0, pp. 1–14, 2017. [44] L. Hewing, K. P . W abersich, and M. N. Zeilinger , “Recursiv ely feasible stochastic model predictive control using indirect feedback, ” Automatica , vol. 119, 2020. [45] A. K. Akametalu, J. F . Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. T omlin, “Reachability-based safe learning with gaussian processes, ” Dec 2014, pp. 1424–1431. [46] J. Schulman, S. Levine, P . Moritz, M. Jordan, and P . Abbeel, “Trust region polic y optimization, ” in 32nd International Confer ence on Machine Learning, ICML 2015 , vol. 3, 2015, pp. 1889–1897. [47] D. Silver , G. Lev er , N. Heess, T . Degris, D. Wierstra, and M. Riedmiller , “Deterministic policy gradient algorithms, ” in Pr oceedings of the 31st International Conference on International Conference on Machine Learning - V olume 32 , ser . ICML 2014. JMLR.org, 2014, p. I–387–I–395. [48] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. A. Castellanos, “ Ac- tiv e policy learning for robot planning and exploration under uncertainty . ” in Robotics: Science and Systems , vol. 3, 2007, pp. 321–328. [49] L. P . Fröhlich, E. D. Klenske, C. G. Daniel, and M. N. Zeilinger , “Bayesian optimization for policy search in high-dimensional systems via automatic domain selection, ” arXiv preprint , 2020. [50] F . Borrelli, Constrained optimal control of linear and hybrid systems . Springer , 2003, vol. 290. 13 [51] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. K olter , “Differentiable conv ex optimization layers, ” in Advances in neural information pr ocessing systems , 2019, pp. 9562–9574. [52] C. Rasmussen and C. W illiams, Gaussian Pr ocesses for Machine Learning , ser . Adapti ve Computation and Machine Learning. MIT Press, 2006. [53] C. M. Bishop, P attern Recognition and Machine Learning (Information Science and Statistics) . Berlin, Heidelberg: Springer-V erlag, 2006. [54] J. Friedman, T . Hastie, and R. Tibshirani, The elements of statistical learning . Springer series in statistics New Y ork, NY , USA:, 2001, vol. 1, no. 10. [55] S. Boyd and L. V andenberghe, Conve x optimization . Cambridge univ ersity press, 2004. [56] S. Boyd, L. El Ghaoui, E. Feron, and V . Balakrishnan, Linear matrix inequalities in system and contr ol theory . SIAM, 1994. [57] L. He wing, A. Carron, K. P . W abersich, and M. N. Zeilinger, “On a correspondence between probabilistic and robust inv ariant sets for linear systems, ” in 2018 European Contr ol Conference (ECC) , 2018. [58] D. Limon, I. Alvarado, T . Alamo, and E. Camacho, “On the design of robust tube-based mpc for tracking, ” IF A C Pr oceedings V olumes , vol. 41, no. 2, pp. 15 333–15 338, 2008. [59] F . Blanchini, “Set in variance in control, ” Automatica , vol. 35, no. 11, pp. 1747 – 1767, 1999. [60] F . D. Brunner, M. Lazar, and F . Allgöwer, “Stabilizing linear model predictiv e control: On the enlargement of the terminal set, ” in 2013 Eur opean Control Conference (ECC) , 2013, pp. 511–517. [61] U. Rosolia and F . Borrelli, “Learning model predictive control for iterati ve tasks: a computationally efficient approach for linear system, ” IF AC- P apersOnLine , vol. 50, no. 1, pp. 3142–3147, 2017. [62] Y . Kuwata, J. T eo, G. Fiore, S. Karaman, E. Frazzoli, and J. P . Ho w , “Real- time motion planning with applications to autonomous urban driving, ” IEEE Tr ansactions on Contr ol Systems T echnology , vol. 17, no. 5, pp. 1105–1118, 2009. [63] M. Neumann-Brosig, A. Marco, D. Schwarzmann, and S. T rimpe, “Data- Efficient Autotuning With Bayesian Optimization: An Industrial Control Study, ” IEEE Tr ansactions on Contr ol Systems T echnology , pp. 1–11, 2019. [64] J. Löfberg, “Y almip : A toolbox for modeling and optimization in matlab, ” in In Proceedings of the CA CSD Conference , T aipei, T aiwan, 2004. [65] M. ApS, The MOSEK optimization toolbox for MATLAB manual. V ersion 9.0. , 2019. [Online]. A vailable: http://docs.mosek.com/9.0/toolbox/index. html Kim P . W abersich receiv ed his BSc. and MSc. in Engineering Cybernetics from the University of Stuttgart in Germany in 2015 and 2017, respecti vely . He is currently working towards a PhD. degree at the Institute for Dynamic Systems and Control (IDSC) at ETH Zurich. During his studies he was a research assistant at the Machine Learning and Robotics Lab (University of Stuttgart) and at the Daimler Autonomous Driving Research Center (Böblingen, Germany and Sunnyvale, CA). His research inter- ests include safe learning-based control and model predictiv e control. Lukas Hewing is a postdoctoral researcher at the Institute for Dynamic Systems and Control (IDSC) at ETH Zurich, from where he receiv ed his Ph.D. in 2020. Prior to this, he receiv ed his M.Sc. in Automation Engineering (with distinction) and B.Sc. in Mechanical Engineering from Aachen Univ ersity in 2015 and 2013, respectively . He was a student research assistant a the Institute of Automatic Control (IR T) and Chair for Medical Information T echnology (MedIT) in Aachen, Germany , and conducted a research stay at Tsinghua University , Beijing, China in 2015. His research interests include safe learning-based and stochastic model predicti ve control. Andrea Carron receiv ed the Dr . Eng. Bachelors and Masters degrees in control engineering from the Univ ersity of Padov a, Padov a, Italy , in 2010 and 2012, respectively . He receiv ed his Ph.D. degree in 2016 from the University of Padov a. He is currently a Postdoctoral Fellow with the Department of Mechanical and Process Engineering at ETH Zürich. His interests include safe-learning, learning- based control, and nonparametric estimation. Melanie N. Zeilinger is an Assistant Professor at ETH Zurich, Switzerland. She received the Diploma degree in engineering cybernetics from the Univ ersity of Stuttgart, Germany , in 2006, and the Ph.D. degree with honors in electrical engineering from ETH Zurich, Switzerland, in 2011. From 2011 to 2012 she was a Postdoctoral Fellow with the Ecole Poly- technique Federale de Lausanne (EPFL), Switzerland. She was a Marie Curie Fellow and Postdoctoral Re- searcher with the Max Planck Institute for Intelligent Systems, Tübingen, Germany until 2015 and with the Department of Electrical Engineering and Computer Sciences at the University of California at Berkeley , CA, USA, from 2012 to 2014. From 2018 to 2019 she was a professor at the University of Freiburg, Germany . Her current research interests include safe learning-based control, as well as distributed control and optimization, with applications to robotics and human-in-the-loop control.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment