Safety-Aware Performance Boosting for Constrained Nonlinear Systems

We study a control architecture for nonlinear constrained systems that integrates a performance-boosting (PB) controller with a scheduled Predictive Safety Filter (PSF). The PSF acts as a pre-stabilizing base controller that enforces state and input …

Authors: Danilo Saccani, Haoming Shen, Luca Furieri

Safety-Aware Performance Boosting for Constrained Nonlinear Systems
Safety-A war e P erf ormance Boosting f or Constrained Nonlinear Systems Danilo Saccani, Haoming Shen, Luca Furieri, Giancarlo Ferrari-T recate Abstract — W e study a control architectur e for nonlinear constrained systems that integrates a performance-boosting (PB) controller with a scheduled Pr edictive Safety Filter (PSF). The PSF acts as a pre-stabilizing base contr oller that enforces state and input constraints. The PB contr oller , parameterized as a causal operator , influences the PSF in two ways: it proposes a performance input to be filtered, and it pro vides a scheduling signal to adjust the filter’s L yapunov-decrease rate. W e pro ve two main r esults: (i) Stability by design: any controller adhering to this parametrization maintains closed-loop stability of the pre-stabilized system and inherits PSF safety . (ii) T rajectory- set expansion: the architecture strictly expands the set of safe, stable trajectories achie vable by controllers combined with conventional PSFs, which rely on a pr e-defined L yapunov decrease rate to ensure stability . This scheduling allows the PB controller to safely execute complex behaviors, such as transient detours, that ar e pr ovably unattainable by standard PSF formulations. W e demonstrate this expanded capability on a constrained in verted pendulum task with a moving obstacle. I . I N T RO D U C T I O N T o operate effecti vely in complex en vironments, au- tonomous systems must go beyond simple stabilization. The main goal is to e xecute adv anced tasks to improve perfor- mance , while guaranteeing safety at all times. Reconciling these performance goals with formal proofs of closed-loop stability and constraint satisfaction remains an open problem. Model Predictive Control (MPC) enforces safety through hard constraints while optimizing a finite-horizon cost [1]. Howe ver , coupling safety , stability , and performance in a single online optimization can be conservati ve [2]. Although multi-trajectory MPC schemes partially mitigate this issue by separating safe” and exploitation” trajectories [2], [3], they still induce an implicit policy based on a finite-horizon approximation of the underlying nonlinear optimal control problem, which may remain conserv ativ e and computation- ally demanding. This motiv ates the search for richer , explicit policy classes. Learning-based approaches, such as Rein- forcement Learning (RL), offer a promising alternati ve by optimizing expressiv e, e xplicit policies from data [4]. Ho w- ev er , generic RL pipelines often lack formal guarantees for safety and stability [5]. Recent works exploit Internal Model This work was supported by the Swiss National Science Foundation (SNSF) through the NCCR Automation, a National Centre of Competence in Research (grant number 51NF40_225155). Luca Furieri is grateful to the SNSF for the Ambizione grant (grant number PZ00P2_208951). D. Saccani and G. Ferrari-T recate are with the Institute of Mechanical Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland. ( {danilo.saccani, giancarlo.ferraritrecate}@epfl.ch) . H. Shen is with the IMT School for Advanced Studies, Lucca, Italy ( haoming.shen@imtlucca.it) . L. Furieri is with the Department of Engineering Science, University of Oxford, United Kingdom ( luca.furieri@eng.ox.ac.uk ). Control (IMC) structure and Y oula-style parameterizations for nonlinear systems to design neural network controllers with closed-loop stability guarantees [6]–[10]. While these methods offer a clean parametrization of all and only the controllers preserving the ℓ p -stability properties of a giv en, possibly prestabilized, nonlinear system, they typically do not nativ ely handle hard state and input constraints required for safety . A model-based strategy to provide these guaran- tees is to filter unsafe inputs before application [11]. While methods like Control Barrier Functions (CBFs) can enforce safety via online Quadratic Programming (QPs), they often require handcrafted certificates and can face infeasibility issues [12]. Predictiv e Safety Filters (PSFs) [13] are MPC- based filters which penalizes de viations from a desired input. Recent variants add an explicit L yapunov-decrease constraint to also enforce stability [14], [15]. While effecti ve, these PSFs hav e two major limitations. First, the fixed, monotonic decrease constraint permanently confines the system within shrinking L yapunov lev el sets. This can be overly conserva- tiv e and fundamentally limits the set of achie vable closed- loop trajectories, rendering complex behaviors, such as safe detours to navigate around an obstacle, impossible. Second, training a parametrized controller wrapped around a PSF with a gradient-descent approach is practically dif ficult, as it requires differentiating through a non-smooth optimization problem, whose solution map can be non-differentiable when constraints become activ e [16]. T o address these limitations, we propose a control scheme decoupling safety , stability , and performance by integrating a parameterized PB controller with a scheduled PSF (Figure 1). Our contrib utions are: (i) a PSF whose L yapunov-decrease rate is scheduled by the PB controller’ s ℓ 2 input to reduce conservati veness while preserving recursive feasibility; (ii) theoretical guarantees of closed-loop ℓ 2 -stability and safety , proving that this architecture strictly expands the set of achiev able trajectories compared to fixed-rate PSFs; and (iii) a data-driven actor-critic training procedure that bypasses the need for differentiating through the PSF optimization problem. Notation The set of all sequences x = ( x 0 , x 1 , . . . ) with x t ∈ R n is denoted by ℓ n . The t -th element x t of a sequence x may also be denoted as ( x ) t . For p ∈ [1 , ∞ ] , define ℓ n p := { x ∈ ℓ n | ∥ x ∥ p := ( P ∞ t =0 | x t | p ) 1 /p < ∞ for p < ∞ , ∥ x ∥ ∞ := sup t | x t | < ∞} , for any vector norm | · | . W e simplify ∥ · ∥ 2 as ∥ · ∥ . An operator A : x 7→ w is said to be ℓ 2 -stable if it is causal and A ( w ) ∈ ℓ m 2 for all w ∈ ℓ n 2 ; equi valently , A ∈ L 2 . W e denote with N the set of positiv e integers and Controller u F  u , x ( 0 )  M θ ( · ) x ( 0 ) x x ( 0 ) System PSF ρ ( x , u L ) u L ψ ( u L ) ρ Fig. 1: The proposed framework. The PB operator M θ ( · ) generates u L , which feeds both the scheduler ψ ( · ) and the PSF . Using the resulting rate ρ and the state x , the PSF filters u L into the safe applied input u . for some a < b , N [ a,b ] = { n ∈ N | a ≤ n ≤ b } . The prediction of a v ariable x over a horizon of length N ≥ 1 at time t are denoted by x i | t with i ∈ N [0: N − 1] . The entire predicted trajectory is denoted by x 0: N | t , or , when clear from the context, by x ·| t . Optimal quantities are marked with the superscript ∗ . The operator E x ∼D [ · ] denotes expectation with respect to the random variable x distributed according to D . I I . P RO B L E M F O R M U L AT I O N Consider the noise-fr ee nonlinear discrete-time dynamical system: x t +1 = f ( x t , u t ) , x 0 ∈ R n , (1) where f : R n × R m → R n , and x t ∈ R n , u t ∈ R m are the state and control input, respectiv ely . The system (1) induces a unique causal transition map x = F ( u , x ( 0 ) ) , (2) where F : ℓ m × ℓ n → ℓ n , u = ( u 0 , u 1 , . . . ) , x = ( x 0 , x 1 , . . . ) and x ( 0 ) := ( x 0 , 0 , 0 , . . . ) denotes the sequence that injects the initial condition. For the system (2) equipped with a causal controller u = K ( x ) , we denote the closed- loop mappings x ( 0 ) 7→ x and x ( 0 ) 7→ u as Φ x ( F , K ) and Φ u ( F , K ) , respecti vely . Ideally , we wish the closed-loop system to be stable according to the following definition. Definition 1: For a gi ven set X s ⊆ R n , the closed-loop system composed by (2) and the causal policy u = K ( x ) is ℓ 2 -stable if Φ u ( F , K ) , Φ x ( F , K ) belongs to L 2 for all x 0 ∈ X s . Moreov er, the system (1) is subject to state and input con- straints , representing physical limitations and safe operating regions, which are expressed as x t ∈ X , u t ∈ U for all t ≥ 0 , where X ⊆ R n and U ⊆ R m are nonempty , compact sets containing the origin. Beyond closed-loop stability and constraint satisfaction, we aim at optimizing performance . W e quantify performance by the infinite-horizon discounted loss L ( x , u ) = E x 0 ∼D " ∞ X t =0 γ t l ( x t , u t ) # , (3) where l : R n × R m → R is a piecewise differentiable stage cost and 0 < γ < 1 is the discount factor . W e assume l is continuous on the compact set X × U , which ensures the total loss L ( x , u ) is bounded [17]. The actual initial state is modeled as a random variable drawn from a distribution D with support X s . Consequently , the system ev olves deterministically once x 0 is sampled, and the only source of uncertainty in closed-loop performance comes from the initial condition. For guaranteeing stability , safety , and performance, our goal is to solve the following problem. Pr oblem 1: Find a nonlinear, time-varying state-feedback controller K ( x ) = ( K 0 ( x 0 ) , . . . , K t ( x 0: t ) , . . . ) solving the following NOC problem. min K ( · ) L ( x , u ) (4a) s.t. x t +1 = f ( x t , u t ) , (4b) u t = K t ( x 0: t ) , (4c) ( Φ x ( F , K ) , Φ u ( F , K )) ∈ L 2 , (4d) x t ∈ X , u t ∈ U , ∀ t = 0 , 1 , . . . , ∞ . (4e) W ithout the state and input constraints (4e), Problem 1 is addressed in [18] via disturbance–feedback control. This approach pro vides a complete parametrization of all and only the stability-preserving controllers for systems subject to an additiv e disturbance w ∈ ℓ p , and it recasts the problem as learning an operator M ∈ L p . In our noise- free setting, the role of w is played by the injected initial condition x ( 0 ) . Howe ver , that framew ork assumes X = R n , U = R m and further , that a baseline (globally) stabilizing controller is already in place, making the closed-loop map from disturbances to states ℓ p -stable. As a result, it does not directly handle hard state and input constraints or plants that are only locally stabilized. I I I . S A F E P E R F O R M A N C E B O O S T I N G Our architecture (Fig. 1) decouples safety (PSF-enforced state and input constraints), performance (PB controller generating u L ), and stability . Stability is achiev ed synergisti- cally: the PSF provides the L yapunov certificate J , while the PB controller is parameterized to ensure u L ∈ ℓ 2 , ev entually triggering a strictly stabilizing decrease in J . Below , we introduce the scheduled PSF and prove that this ℓ 2 property guarantees closed-loop stability as per Definition 1. Throughout the paper, we work in the noise-free, ℓ 2 setting. Once the inner loop, composed by the system and the PSF , is ℓ 2 -stable, an ℓ 2 performance input u L (with a fixed initial condition x ( 0 ) ) implies ( x , u ) ∈ ℓ 2 ; hence trajectories are bounded (because ℓ 2 ⊂ ℓ ∞ ). Extensions to other p -norms follow with minor modifications. A. PSF with scheduled Lyapunov-decr ease W e first introduce the PSF block in Figure 1, which connects the operator M ( · ) and the plant F . Given a prediction horizon N ≥ 1 , we define the following certificate function J : R n · ( N +1) × R m · N → R J ( x 0: N , u 0: N − 1 ) = N − 1 X i =0 s ( x i , u i ) + m ( x N ) , (5) to serve as a L yapunov function for closed-loop stability , where s : R n × R m → R ≥ 0 and m : R n → R ≥ 0 are the stage and terminal functions, respectively . The PSF aims to minimize the de viation between the filtered input and the performance input while ensuring constraint satisfaction and stability . Consider the follo wing PSF with stability certificate [14], [15] at time t : min u ·| t   u 0 | t − u L ,t   2 (6a) s.t. x 0 | t = x t (6b) x i +1 | t = f ( x i | t , u i | t ) , ∀ i ∈ N [0 ,N − 1] (6c) x i | t ∈ X , ∀ i ∈ N [0 ,N − 1] (6d) u i | t ∈ U , ∀ i ∈ N [0 ,N − 1] (6e) x N | t ∈ Z f (6f) J ( x ·| t , u ·| t ) ≤ J ( x ∗ ·| t − 1 , u ∗ ·| t − 1 ) − (1 − ρ ) · s ( x ∗ 0 | t − 1 , u ∗ 0 | t − 1 ) , (6g) where ( x ∗ ·| t , u ∗ ·| t ) denotes an optimizer of (6), Z f ⊆ U is the terminal set and u L ,t ∈ R m is the performance input from the PB controller . The first optimal predicted control input u ∗ 0 | t is applied to the system. Constraint (6g) enforces a minimal decrease of J at each time step. The parameter ρ ∈ [0 , 1) sets the required decrease, guaranteeing a uniform stability margin 1 − ρ > 0 ; smaller ρ implies a stronger decrease [15]. This yields a L yapunov-like certificate. Howe ver , using a fixed decrease rate ρ can shrink the admissible input set ov er time. W e therefore modify (6g) to reduce conserv ativeness during performance optimization. Specifically we relax the constraint ρ ∈ [0 , 1) and allow a time-varying ρ t ∈ R ≥ 0 gov erned by a scalar function ψ . Definition 2 (T ightening schedule): Fix ¯ ρ ∈ [0 , 1) , ε > 0 and a ceiling value ρ max ≥ 1 . The map ψ : [0 , ∞ ) → [0 , ∞ ) is a tightening schedule function if it is nondecreasing and continuous, and satisfy (P1) ψ ( r ) = ¯ ρ ∀ r ∈ [0 , ε ] , (P2) ψ ( r ) ≥ ¯ ρ ∀ r ≥ 0 , (P3) ψ ( r ) ≤ ρ max , ∀ r ≥ 0 . Giv en a performance input u L ,t , we set ρ t := ψ ( ∥ u L ,t ∥ ) . This defines the causal operator ψ : ℓ m → ℓ that maps the signal u L to the scheduling signal ρ . The proposed tightening couples the performance input and the decreasing rate. Indeed, if u L ∈ ℓ 2 , then ∥ u L ,t ∥ → 0 . By (P1), there exists T such that for all t ≥ T we hav e that ∥ u L ,t ∥ ≤ ϵ and ρ t = ψ ( ∥ u L ,t ∥ ) = ¯ ρ . Thus, the schedule is automatically stabilizing for t ≥ T with uniform stability mar gin 1 − ¯ ρ > 0 , while it can be more permissiv e in the transient [0 , T ) (when ∥ u L ∥ is lar ger), reducing conservati veness. A minimal schedule illustrating the “if ∥ u L ,t ∥ > ε then relax” strategy is ψ ( r ) = ( ¯ ρ, r ≤ ε, ¯ ρ + ( ρ max − ¯ ρ ) min  r − ε ε , 1  , r > ε, (7) where r := ∥ u L ,t ∥ . Note that prior stability-enhanced PSFs already allow time-varying L yapunov-decrease rates (e.g., ζ t in [14]), but they are decoupled from the external controller providing u L and effecti vely bounded by 1 . Here, instead we couple the schedule ρ t to u L ,t and ev en permit ρ t > 1 transiently . This allows the PB controller to trade a transient L yapunov-decrease for improved performance, while guar- anteeing closed-loop stability . With the proposed scheduler, at each time t we solve (6) with (6g) replaced by J ( x ·| t , u ·| t ) ≤ J ( x ∗ ·| t − 1 , u ∗ ·| t − 1 ) − (1 − ρ t ) · s ( x ∗ 0 | t − 1 , u ∗ 0 | t − 1 ) . (8a) Let ( x ∗ ·| t , u ∗ ·| t ) be an optimizer of (6) and define the certificate optimal value by J ∗ t := J ( x ∗ ·| t , u ∗ ·| t ) . W e apply the first input in receding-horizon fashion, u t := u ∗ 0 | t , inducing the PSF policy u t = κ PSF ( x t , u L ,t , J ∗ t − 1 ) . T o initialize the PSF at t = 0 , we must set the initial v alue of the certificate cost J ∗ − 1 . This is done by using any feasible warm start ( x ∗ ·|− 1 , u ∗ ·|− 1 ) and setting J ∗ − 1 := J ( x ∗ ·|− 1 , u ∗ ·|− 1 ) ; alternativ ely , one can set J ∗ − 1 = + ∞ to deactiv ate (8a) at the first step, and repeat. W e introduce necessary assumptions to guarantee recur- siv e feasibility and ℓ 2 -stability for the PSF (6). Assumption 1: The stage cost s : R n × R m → R ≥ 0 is continuous with s (0 , 0) = 0 and is positi ve definite on X × U , i.e., s ( x, u ) > 0 for all ( x, u ) ∈ ( X × U ) \ { (0 , 0) } . Moreov er , there exists constants q x , q u > 0 such that s ( x, u ) ≥ q x ∥ x ∥ 2 + q u ∥ u ∥ 2 , ∀ ( x, u ) ∈ X × U . Assumption 2: There e xist a terminal set Z f ⊆ X , a terminal cost m : Z f → R ≥ 0 , and an auxiliary control law κ : Z f → U such that: (i) κ ( x ) ∈ U and f ( x, κ ( x )) ∈ Z f ; (ii) m ( f ( x, κ ( x ))) − m ( x ) ≤ − s ( x, κ ( x )) . These assumptions are standard in MPC settings. Assump- tion 1 provides a quadratic lo wer bound that, allows one to prov e that states and inputs are in ℓ 2 , and is met, e.g., by a quadratic stage cost in (5). The next result precises this statement. Assumption 2 requires a pre-stabilizing controller in the terminal set Z f . A classical way to satisfy it is to use linearized terminal ingredients (see e.g., [19]); a simpler alternativ e, altough more restrictiv e in terms of region of attraction [1], is to impose a terminal equality in place of a terminal set, i.e., Z f = { 0 } and m ≡ 0 . Before stating the main theorem, we highlight its ke y non- trivial aspect. Standard PSF stability proofs [14], [15] rely on a uniform L yapunov decrease, requiring the decrease rate ρ to be strictly less than 1 at all times. Our analysis is fundamentally different: the decrease rate ρ t = ψ ( ∥ u L ∥ ) is signal-dependent and e xplicitly allo wed to be ≥ 1 transiently . This inv alidates an y proof based on a per-step decrease. Instead, our theorem proves ℓ 2 -stability by showing that the signal property u L ∈ ℓ 2 is suf ficient to e ventually enforce the uniform decrease rate ¯ ρ , ensuring closed-loop ℓ 2 -stability . Theor em 1: Suppose Assumptions 1 – 2 hold and let the PSF (6) use the scheduled stability constraint (8a) where ρ t = ψ ( ∥ u L ,t ∥ ) is generated by a tightening schedule ψ satisfying Definition 2. If this problem is feasible at time t = 0 , then it is feasible at all t ≥ 1 . Moreover , if u L ∈ ℓ 2 , then the induced closed loop is ℓ 2 -stable, i.e., x , u ∈ ℓ 2 , and x t ∈ X , u t ∈ U , ∀ t ≥ 0 . The proof can be found in Appendix A. B. Behavioral characterization and strict dominance in ℓ 2 W e characterize the closed-loop trajectories generated by the PSF when coupled with the PB controller . By Theorem 1, once (6) with (8a) is feasible at t = 0 , the PSF remains feasible and the closed-loop is ℓ 2 -stable for any performance input u L ∈ ℓ 2 . Hence, the key question is how the set of ℓ 2 closed-loop trajectories realizable by our scheduled architecture compares to that of a fixed-decrease PSF (6). W e prov e our architecture’ s set (i) contains and (ii) strictly enlarges the fixed-decrease set. W e define a tightening pr ofile as any sequence ρ = { ρ t } t ≥ 0 with ρ t ∈ [0 , ρ max ) . Let P be a set of tightening profiles. The set of all possible ℓ 2 closed-loop trajectories induced by P is defined as B ( P ) :=  ( x , u ) ∈ ℓ n 2 × ℓ m 2   ∃ x 0 ∈ X , ∃ u L ∈ ℓ 2 , ∃ ρ ∈ P : (6) with (8a) is feasible at t = 0 , and yields ( x , u )  . W e will compare the sets of achiev able trajectories induced by the following sets P fix ( ¯ ρ ) := { ρ : ρ t ≡ ¯ ρ } , (9) P sch ( ψ ) :=  ρ : ∃ u L ∈ ℓ 2 s.t. ρ t = ψ ( ∥ u L ,t ∥ )  . (10) The overall set of trajectories B is determined by the ad- missible input set the PSF allo ws at each time step, which in turn depends on the tightening profile ρ . W e define this admissible input set and the induced one-step reachable set as follows: Definition 3: At time t , giv en x t , the certificate J ∗ t − 1 from the previous time, and ρ ∈ [0 , ρ max ) , define U t ( ρ, J ) := n v ∈ U    ∃ ( x 1: N | t , u 1: N − 1 | t ) s.t. x 0 | t = x t , u 0 | t = v , (6c) − (6f) , J ( x ·| t , u ·| t ) ≤ J − (1 − ρ ) s ( x ∗ 0 | t − 1 , u ∗ 0 | t − 1 ) o , where J := J ∗ t − 1 and ( x ∗ ·| t − 1 , u ∗ ·| t − 1 ) is the previous opti- mizer . The one-step reachable set X + t ( ρ, J ) := { f ( x t , v ) : v ∈ U t ( ρ, J ) } . Lemma 1: Fix x t and ( x ∗ ·| t − 1 , u ∗ ·| t − 1 ) . If ρ 1 ≤ ρ 2 and J 1 ≤ J 2 , then U t ( ρ 1 , J 1 ) ⊆ U t ( ρ 2 , J 2 ) , X + t ( ρ 1 , J 1 ) ⊆ X + t ( ρ 2 , J 2 ) . Moreov er, if s ( x ∗ 0 | t − 1 , u ∗ 0 | t − 1 ) > 0 and the constraint (6g) is active at ( ρ 1 , J 1 ) , then U t ( ρ 1 , J 1 ) ⊊ U t ( ρ 2 , J 2 ) whenever either ρ 2 > ρ 1 or J 2 > J 1 . The proof can be found in Appendix B. W e now lift the per-step result from Lemma 1 to compare the full sets of achiev able trajectories. W e will sho w that the set from the scheduled architecture strictly contains all trajectories achiev able by the fixed- ρ architecture. Theor em 2: Fix an initial value of the certificate function J ∗ − 1 . Let ¯ ρ ∈ [0 , 1) and let ψ be a tightening schedule. Then B ( P fix ( ¯ ρ )) ⊆ B ( P sch ( ψ )) . (11) Moreov er, suppose there e xists a closed-loop trajectory under the scheduled architecture from some x 0 ∈ X and a time ˜ t such that ρ sch t > ¯ ρ and U ˜ t ( ρ sch ˜ t , J ∗ ˜ t − 1 ) ⊋ U ˜ t ( ¯ ρ, J ∗ ˜ t − 1 ) . (12) Then there exists a performance input ˜ u L ∈ ℓ 2 for which the scheduled PSF produces a trajectory ( x , u ) ∈ B ( P sch ( ψ )) \ B ( P fix ( ¯ ρ )) . (13) The proof can be found in Appendix C. Theorem 2 shows that a fixed-rate PSF constrains the applied control action u t to the set U t ( ¯ ρ, J ∗ t − 1 ) . The proposed sched- uled architecture only enforces this stability constraint ( ρ t = ¯ ρ < 1 ) after the transient (i.e., once ∥ u L ,t ∥ ≤ ε ) and relaxes the bound otherwise. Consequently , whene ver ψ ( ∥ u L ,t ∥ ) > ¯ ρ and the L yapunov inequality is active, the admissible input set U t strictly enlarges. The proof of Theorem 2 shows that applying any input from this enlarged set-difference at ev en a single time step results in a closed-loop trajectory that is unattainable by the fixed-rate architecture. C. P arameterization of the PB contr oller By Theorem 1, closed-loop ℓ 2 -stability and constraint sat- isfaction are guaranteed provided the performance input u L ∈ ℓ 2 for any x 0 ∈ X s , where X s = { x ∈ X : Problem (6) with (8a) is feasible at t = 0 } . This section describes a parametrization for the PB controller M , that ensures this ℓ 2 condition on u L holds by construction. W e adopt the Magnitude-and-Direction (MAD) policy of [18] to parameterize the operator M θ ( x , x ( 0 ) ) as u L = M θ ( x , x ( 0 ) ) = | A ( x ( 0 ) ) | ⊙ D ( x ) , (14) where ⊙ denotes the elementwise product, A ∈ L 2 , and D ( x ) ∈ ℓ m ∞ for all x ∈ ℓ n . For instance, D can be induced by the bounded static map d ( x t ) = tanh(NN θ 1 ( x t )) ∈ [ − 1 , 1] m . Since ∥ D ( x ) ∥ ∞ ≤ 1 and A ∈ L 2 , the resulting input u L ∈ ℓ 2 by construction for all θ [18]. T o parameterize the nonlinear operator A ∈ L 2 , we use a Linear Recurrent Unit (LR U) [20], one of se veral learning architectures that guarantee A ( · , θ 2 ) ∈ L 2 for all θ 2 ∈ R n θ 2 [20]–[23]. Consequently , the full MAD policy guarantees u L ∈ ℓ 2 for all θ = [ θ ⊤ 1 , θ ⊤ 2 ] ⊤ ∈ R nθ 1 + n θ 2 . D. T raining pr ocedure Direct end-to-end differentiation through the PSF is chal- lenging because the PSF is defined by a parametric opti- mization problem whose solution map may become non- differentiable when activ e sets change [16]. W e therefore treat the PSF–plant interconnection as a black-box aug- mented system and train M θ with an off-policy actor–critic method, e.g., DDPG [24], following [18]. Specifically , we optimize the discounted infinite-horizon performance loss with the actor µ θ defining the learning input u L ,t = µ θ ( ζ t ) . Fig. 2: Angle trajectories with obstacle av oidance. Lines with varying opacity correspond to multiple initial conditions. Blue lines: the proposed approach. Orange lines: the PSF with ρ t ≡ ¯ ρ and u L ≡ 0 . The translucent red region shows the moving obstacle in the ( t, θ ) plane; the dotted black line marks the lower and upper bounds, while the dashed black line marks the unstable equilibrium ( θ = 0 ). Using deterministic policy gradients (DPG) [25], the actor gradient is ∇ θ L ( θ ) = E ζ t ∼ ρ µ h ∇ θ µ θ ( ζ t ) ∇ u L Q µ ( x t , u L )   u L = µ θ ( ζ t ) i , (15) where ζ t is the input to the actor 1 and the expectation E ζ t ∼ ρ µ is taken ov er the distribution of actor inputs ζ t visited under the policy µ . The critic Q µ ( x, u L ) estimates the value of applying the learning input u L when the system state is x . It is trained to minimize the error in satisfying the Bellman equation. In our noise-free setting, the PSF-plant dynamics are deterministic: a giv en ( x, u L ) pair produces a single, unique next state x + = f ( x, κ PSF ( x, u L , J ∗ )) . This simplifies the Bellman equation to: Q µ ( x, u L ) = l ( x, κ PSF ( x, u L , J ∗ )) + γ Q µ ( x + , µ θ ( ζ + )) (16) where ζ + is the next actor input corresponding to x + . This deterministic form is simpler than the standard Bellman equation for stochastic systems, as it does not require an expectation over the next state x + [24], [25]. Crucially , the gradient calculation in (15) does not require the deriv ativ es of the PSF solution map κ PSF . Instead, the gradient of the Q -function with respect to u L (which is learned by the critic) serves as the necessary gradient information, bypassing the problematic end-to-end differentiation. During training, exploration is added to u L ,t , and the PSF filters unsafe proposals, preserving safety and stability throughout data collection. I V . N U M E R I C A L E X A M P L E In this section, we demonstrate the proposed scheme on a pendulum stabilization task chosen to highlight a setting where alternativ e methods can f ail. The goal is to stabilize the pendulum at its unstable upright equilibrium, which is not 1 The definitiom of ζ t is problem-specific. For the recurrent operator in Section III-C, it includes the physical state x t , the element from the initial condition sequence ( x ( 0 ) ) t , the internal state of the recurrent network, and the previous value of the certificate J ∗ t − 1 . directly compatible with IMC-based approaches requiring a stable or pre-stabilized plant [6], [18]. The task also includes a moving obstacle, forcing a detour in the angular trajectory . As we sho w , this detour requires a transient increase of the L yapunov certificate J ∗ , which is unattainable under a fixed- rate PSF but enabled by our scheduling mechanism. W e consider the pendulum dynamics: ˙ θ = ω , ˙ ω = − g l sin θ − b ml 2 ˙ θ + 1 ml 2 u, where θ and ω denote the angle and angular velocity , respectiv ely; m = 0 . 2 Kg is the mass; l = 0 . 5 m is the pendulum length; and g and b = 0 . 02 N m s / rad represent the gravitational acceleration and damping coefficient. W e obtain a discrete time model using a fourth-order Runge– Kutta (RK4) step with sampling time T s = 0 . 05 s . The safe sets and bounds are θ ∈ [ − 2 . 6 , 2 . 6] rad , u ∈ [ − 3 , 3]Nm , and the task is to stabilize the pendulum upright to the unstable equilibrium ( θ = 0 ) while av oiding a moving obstacle with center in p obs t that sweeps at a fixed speed v obs = 0 . 2 m / s from right to left on the x axis. In Figure 2, the translucent red region shows the space-time area occupied by the moving obstacle. As in Section III, the inner loop is the PSF with horizon N = 20 . For this example, we use a terminal equality constraint ( x N | t = 0 ) and define the L yapunov certificate J from (5) using the stage cost s ( x k , u k ) = ∥ x k ∥ 2 Q + ∥ u k ∥ 2 R ( Q ⪰ 0 , R ≻ 0 ) and a terminal cost of zero ( m ≡ 0 ). The stability constraint uses the tightening schedule ρ t = ψ ( ∥ u L ,t ∥ ) with parameters ¯ ρ = 0 . 5 , ε = 0 . 05 , and ρ max = 10 (see Definition 2), and the function ψ is a smooth version of (7). The MAD operator u L = M θ ( x , x ( 0 ) ) parametrizes A ∈ L 2 using an LR U. Since the task inv olves a moving obstacle, the perfor- mance objective (3) becomes the time-varying, discounted loss: L tv ( θ ) = E x 0 ∼D h P t ≥ 0 γ t l t ( x t , u t ) i , where the time- varying stage cost l t is defined as the weighted sum: l t ( x t , u t , u L ,t ) = β 1 l tra j ( x t , u t ) + β 2 l psf ( u t , u L ,t ) + β 3 l obs ,t ( x t ) , where β i > 0 are positiv e weights. Here, l tra j is a quadratic tracking and control-effort penalty , l obs ,t penalizes collisions with the time-varying obstacle, and l psf adds a small penalty on PSF intervention. The moving obstacle induces a time-varying stage cost, which we handle through a standard augmented-state reformulation ˜ x t := [ x ⊤ t , c ⊤ t ] ⊤ , where c t := [ p obs ⊤ t , v obs ⊤ ] contains the obstacle’ s state. W e implemented the approach in PyT orch 2 . W e compare the scheduled architecture against a fixed-rate PSF with ρ t = ¯ ρ and u L ≡ 0 . This baseline enforces a monotonic decrease in J , confining trajectories to shrinking lev el sets and thus forbidding transient detours. By Theorem 2, our architecture recov ers all fixed-rate trajectories when u L ≡ 0 and can strictly enlarge this set. For a fair comparison, both methods 2 Code to reproduce the experiments and figures, as well as an ani- mated visualization of the task, is available at https://github.com/ DecodEPFL/PSF_PB.git . Fig. 3: Time ev olution of the PSF certificate J ∗ t for different initial conditions. Blue lines: the proposed approach. Orange lines: the PSF with ρ t ≡ ¯ ρ and u L ≡ 0 . use the same certificate J , horizon N , and stability margin ¯ ρ . Unsafe RL baselines are omitted, as they provide no safety guarantees. Figures 2 and 3 show the angle θ t and L yapunov certificate J ∗ t trajectories. Both methods respect the safety bounds (Fig. 2, dotted lines). Howe ver , the fixed- ¯ ρ baseline (orange) fails the task, as its path would collide with the obstacle (red re gion). This failure is fundamental: Fig. 3 (orange) shows J ∗ t is forced to decrease monotonically , trapping the system in shrinking level sets and making the detour prov ably impossible. Our scheduled architecture (blue) succeeds by breaking this limitation. As shown in Fig. 3 (blue), J ∗ t is allowed to increase transiently ( ρ t > ¯ ρ ), providing the freedom to execute the safe detour (Fig. 2, blue). After bypassing the obstacle, u L → 0 , ρ t tightens to ¯ ρ , and J ∗ t con verges to zero, ensuring eventual stability . This behavior illustrates the per -step set enlargement U t ( ρ, J ) and the strict separation shown in Theorem 2. V . C O N C L U S I O N S This paper introduced a control architecture integrating a PB controller with a scheduled PSF to decouple safety and performance. W e prov ed that scheduling the L yapunov- decrease rate strictly expands the set of achie vable safe trajectories, enabling transient detours as demonstrated in a pendulum obstacle-av oidance task. The proposed actor- critic training method successfully avoids differentiating through the PSF’ s optimization map. Future work includes extending the stability analysis to ℓ p -gains, incorporating state estimation, and generalizing the approach to multi-agent coordination. R E F E R E N C E S [1] J. B. Ra wlings, D. Q. Mayne, M. Diehl, et al. , Model pr edictive contr ol: theory , computation, and design . Nob Hill Publishing Madison, WI, 2020, vol. 2. [2] D. Saccani, L. Cecchin, and L. Fagiano, “Multitrajectory model predictiv e control for safe uav navigation in an unkno wn environment, ” IEEE T ransactions on Contr ol Systems T echnology , vol. 31, no. 5, pp. 1982–1997, 2022. [3] D. Saccani, L. Fagiano, M. N. Zeilinger , and A. Carron, “Model pre- dictiv e control for multi-agent systems under limited communication and time-varying network topology , ” in 2023 62nd IEEE Conference on Decision and Contr ol (CDC) . IEEE, 2023, pp. 3764–3769. [4] C. T ang, B. Abbatematteo, J. Hu, R. Chandra, R. Martín-Martín, and P . Stone, “Deep reinforcement learning for robotics: A survey of real- world successes, ” in Pr oceedings of the AAAI Confer ence on Artificial Intelligence , vol. 39, no. 27, 2025, pp. 28 694–28 698. [5] S. Gu, L. Y ang, Y . Du, G. Chen, F . W alter , J. W ang, and A. Knoll, “ A re view of safe reinforcement learning: Methods, theories and applications, ” IEEE Tr ansactions on P attern Analysis and Machine Intelligence , 2024. [6] L. Furieri, C. L. Galimberti, and G. Ferrari-T recate, “Learning to boost the performance of stable nonlinear systems, ” IEEE Open Journal of Contr ol Systems , 2024. [7] D. Saccani, L. Massai, L. Furieri, and G. Ferrari-Trecate, “Optimal distributed control with stability guarantees by training a network of neural closed-loop maps, ” in 2024 IEEE 63r d Confer ence on Decision and Contr ol (CDC) . IEEE, 2024, pp. 3776–3781. [8] L. Furieri, C. L. Galimberti, and G. Ferrari-Trecate, “Neural system lev el synthesis: Learning over all stabilizing policies for nonlinear systems, ” in 2022 IEEE 61st Conference on Decision and Control (CDC) . IEEE, 2022, pp. 2765–2770. [9] R. W ang and I. R. Manchester, “Y oula-ren: Learning nonlinear feed- back policies with robust stability guarantees, ” in 2022 American Contr ol Conference (ACC) . IEEE, 2022, pp. 2116–2123. [10] C. L. Galimberti, L. Furieri, and G. Ferrari-Trecate, “Parametrizations of all stable closed-loop responses: From theory to neural network control design, ” Annual Reviews in Control , vol. 60, p. 101012, 2025. [11] K. P . W abersich, A. J. T aylor, J. J. Choi, K. Sreenath, C. J. T om- lin, A. D. Ames, and M. N. Zeilinger, “Data-Driven Safety Fil- ters: Hamilton-Jacobi Reachability , Control Barrier Functions, and Predictiv e Methods for Uncertain Systems, ” IEEE Contr ol Systems Magazine , vol. 43, no. 5, pp. 137–177, Oct. 2023. [12] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P . T abuada, “Control barrier functions: Theory and applications, ” in 2019 18th Eur opean control conference (ECC) , 2019. [13] K. P . W abersich and M. N. Zeilinger, “ A predictive safety filter for learning-based control of constrained nonlinear dynamical systems, ” Automatica , vol. 129, p. 109597, 2021. [14] E. Milios, K. P . W abersich, F . Berkel, and L. Schwenkel, “Stability mechanisms for predictiv e safety filters, ” in 2024 IEEE 63r d Confer- ence on Decision and Control (CDC) . IEEE, 2024, pp. 2409–2416. [15] A. Didier, A. Zanelli, K. P . W abersich, and M. N. Zeilinger, “Pre- dictiv e stability filters for nonlinear dynamical systems affected by disturbances, ” IF AC-P apersOnLine , vol. 58, no. 18, pp. 200–207, 2024. [16] J. A. Andersson and J. B. Rawlings, “Sensitivity analysis for nonlinear programming in casadi, ” IF AC-P apersOnLine , vol. 51, no. 20, 2018. [17] D. Bertsekas, Dynamic pro gramming and optimal control: V olume I . Athena scientific, 2012, vol. 4. [18] L. Furieri, S. Shenoy , D. Saccani, A. Martin, and G. Ferrari-Trecate, “Mad: A magnitude and direction policy parametrization for stability constrained reinforcement learning, ” in 2025 IEEE 64th Conference on Decision and Contr ol (CDC) . IEEE, 2025, pp. 942–947. [19] H. Chen and F . Allgöwer, “ A quasi-infinite horizon nonlinear model predictiv e control scheme with guaranteed stability , ” Automatica , vol. 34, no. 10, pp. 1205–1217, 1998. [20] A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De, “Resurrecting recurrent neural networks for long se- quences, ” in International Confer ence on Machine Learning . PMLR, 2023, pp. 26 670–26 698. [21] M. Rev ay , R. W ang, and I. R. Manchester, “Recurrent equilibrium networks: Flexible dynamic models with guaranteed stability and robustness, ” IEEE T ransactions on Automatic Contr ol , vol. 69, no. 5, pp. 2855–2870, 2023. [22] L. Massai and G. Ferrari-T recate, “Free parametrization of L2- bounded state space models, ” arXiv preprint , 2025. [23] M. Zakwan and G. Ferrari-Trecate, “Neural port-hamiltonian models for nonlinear distrib uted control: An unconstrained parametrization approach, ” arXiv preprint , 2024. [24] T . P . Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T . Erez, Y . T assa, D. Silver , and D. Wierstra, “Continuous control with deep reinforce- ment learning, ” arXiv preprint , 2015. [25] D. Silver , G. Lev er , N. Heess, T . Degris, D. W ierstra, and M. Ried- miller , “Deterministic policy gradient algorithms, ” in International confer ence on machine learning . Pmlr , 2014, pp. 387–395. A P P E N D I X A. Pr oof of Theor em 1 W e start by pro ving the recursiv e feasibility of the PSF . Assume the problem (6) with (8a) is feasible at time t − 1 . W e define the candidate solution at time t by using standard shifting ar guments [1]: for i = 0 , . . . , N − 2 , set ˆ x i | t = x ∗ i +1 | t − 1 and ˆ u i | t = u ∗ i +1 | t − 1 , ˆ u N − 1 | t = κ ( x ∗ N | t − 1 ) and ˆ x N | t = f ( x ∗ N | t − 1 , κ ( x ∗ N | t − 1 )) . By Assumption 2(i), state/input constraints and the terminal set condition hold. For the certificate function, consider the following candidate J ( ˆ x ·| t , ˆ u ·| t ) = N − 1 X i =1 s ( x ∗ i | t − 1 , u ∗ i | t − 1 ) + s ( x ∗ N | t − 1 , κ ( x ∗ N | t − 1 )) + m  f ( x ∗ N | t − 1 , κ ( x ∗ N | t − 1 ))  , using Assumption 2(ii) we have J ( ˆ x ·| t , ˆ u ·| t ) ≤ N − 1 X i =1 s ( x ∗ i | t − 1 , u ∗ i | t − 1 ) + m ( x ∗ N | t − 1 ) = J ( x ∗ ·| t − 1 , u ∗ ·| t − 1 ) − s ( x ∗ 0 | t − 1 , u ∗ 0 | t − 1 ) ≤ J ( x ∗ ·| t − 1 , u ∗ ·| t − 1 ) − (1 − ρ t ) s ( x ∗ 0 | t − 1 , u ∗ 0 | t − 1 ) . Since s ( · ) ≥ 0 , the inequality holds for 0 ≤ ρ t < 1 (and is trivially looser for ρ t ≥ 1 ); hence the shifted candidate is feasible at time t . Since the PSF (6) with (8a) is recursiv ely feasible and because the optimizer satisfies (8a), we have J ( x ∗ ·| t , u ∗ ·| t ) ≤ J ( ˆ x ·| t , ˆ u ·| t ) , obtaining: J t ≤ J t − 1 − (1 − ρ t ) s t − 1 . (17) with J t = J ( x ∗ ·| t , u ∗ ·| t ) and s t − 1 = s ( x ∗ 0 | t − 1 , u ∗ 0 | t − 1 ) . Be- cause u L ∈ ℓ 2 we ha ve ∥ u L ,t ∥ → 0 . By Definition 2(P1), there exists T such that for all t ≥ T we ha ve ρ t = ψ ( ∥ u L ,t ∥ ) = ¯ ρ . Define the uniform mar gin η := 1 − ¯ ρ > 0 . Summing (17) from t = T + 1 to k yields k X t = T +1 ( J t − 1 − J t ) ≥ k X t = T +1 (1 − ρ t ) s t − 1 , J T − J k ≥ k X t = T +1 (1 − ρ t ) s t − 1 ≥ η k X t = T +1 s t − 1 Since s ( · ) , m ( · ) ≥ 0 one has J k ≥ 0 and, hence η k X t = T +1 s t − 1 ≤ J T − J k ≤ J T . By using Assumption 1 and letting k → ∞ , η ∞ X i = T  q x ∥ x ∗ 0 | i ∥ 2 + q u ∥ u ∗ 0 | i ∥ 2  ≤ J ( x ·| T , u ·| T ) < ∞ , so that P ∞ t =0 ∥ x t ∥ 2 < ∞ and P ∞ t =0 ∥ u t ∥ 2 < ∞ , i.e., x , u ∈ ℓ 2 . Constraint satisfaction at all times follows from recursiv e feasibility . ■ B. Pr oof of Lemma 1 The pair ( ρ, J ) only enters in the definition of U t through J ( x ·| t , u ·| t ) ≤ J − (1 − ρ ) s ( x ∗ 0 | t − 1 , u ∗ 0 | t − 1 ) . Increasing ρ or J relaxes the L yapunov constraint while all other constraints are unchanged. Thus, any plan feasible for ( ρ 1 , J 1 ) remains feasible for ( ρ 2 , J 2 ) , proving the inclusions. Moreov er , if the L yapunov inequality is acti ve and s ( · ) > 0 , this relaxation strictly enlarges the set: U t ( ρ 1 , J 1 ) ⊊ U t ( ρ 2 , J 2 ) . ■ C. Pr oof of Theor em 2 W e first prove the inclusion (11). Any trajectory ( x , u ) ∈ B ( P fix ( ¯ ρ )) can be reproduced by the scheduled architecture by setting the performance input u L ,t := u t . Since ρ sch t = ψ ( ∥ u t ∥ ) ≥ ¯ ρ , by Lemma 1, u t remains a feasible action. As u 0 | t = u t achiev es zero cost for the PSF objective (6a), it is optimal. Proceding by induction on t yields (11). W e now prove the strict separation (13). Pick any v ∈ U ˜ t ( ρ sch ˜ t , J ∗ ˜ t − 1 ) \ U ˜ t ( ¯ ρ, J ∗ ˜ t − 1 ) giv en (12). Construct a perfor- mance input ˜ u L ∈ ℓ 2 verifying ˜ u L ,τ = u τ for τ < ˜ t , ˜ u L , ˜ t = v and ˜ u L ,τ = 0 for τ > ˜ t . Up to time ˜ t − 1 both architectures can apply the same inputs (by (11)), so they share the same J ∗ ˜ t − 1 . The scheduled PSF can apply u ˜ t = v (its optimal action), while the fixed-rate PSF cannot (as v is infeasible for it). The resulting trajectory is therefore in the set difference, proving (13). ■

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment