Model-Free Output Feedback Stabilization via Policy Gradient Methods
Stabilizing a dynamical system is a fundamental problem that serves as a cornerstone for many complex tasks in the field of control systems. The problem becomes challenging when the system model is unknown. Among the Reinforcement Learning (RL) algorithms that have been successfully applied to solve problems pertaining to unknown linear dynamical systems, the policy gradient (PG) method stands out due to its ease of implementation and can solve the problem in a model-free manner. However, most of the existing works on PG methods for unknown linear dynamical systems assume full-state feedback. In this paper, we take a step towards model-free learning for partially observable linear dynamical systems with output feedback and focus on the fundamental stabilization problem of the system. We propose an algorithmic framework that stretches the boundary of PG methods to the problem without global convergence guarantees. We show that by leveraging zeroth-order PG update based on system trajectories and its convergence to stationary points, the proposed algorithms return a stabilizing output feedback policy for discrete-time linear dynamical systems. We also explicitly characterize the sample complexity of our algorithm and verify the effectiveness of the algorithm using numerical examples.
💡 Research Summary
The paper tackles the problem of stabilizing discrete‑time linear systems when only output measurements are available and the system model (A, B, C) is completely unknown. While policy‑gradient (PG) methods have been successfully applied in a model‑free fashion to full‑state feedback (SF) LQR problems, extending them to static output‑feedback (SOF) is non‑trivial because the LQR cost is no longer gradient‑dominant and the set of stabilizing policies can be disconnected.
To overcome these difficulties, the authors introduce a two‑level algorithm that combines a discount‑factor mechanism with a zeroth‑order (two‑point) gradient estimator. The discount factor γ∈(0,1) “damps” the original dynamics: by defining a scaled state (\tilde{x}t = \gamma^{t/2} x_t), the closed‑loop dynamics become (\tilde{x}{t+1}= \sqrt{\gamma}(A-BKC)\tilde{x}t). For any γ<1, the condition (\sqrt{\gamma},\rho(A-BKC)<1) defines a non‑empty set (S\gamma) of stabilizing SOF gains. The discounted infinite‑horizon cost
(J_\gamma(K)=\mathbb{E}\sum_{t=0}^\infty \tilde{x}_t^\top (Q+C^\top K^\top R K C)\tilde{x}t)
is finite precisely when (K\in S\gamma).
Because the true gradient (\nabla J_\gamma(K)) cannot be computed without the model, the algorithm estimates it using a zeroth‑order two‑point method: two rollouts are performed with perturbed gains (K\pm\delta U) (U is a random direction, δ a small scalar), and the gradient estimate is
(\widehat{\nabla J_\gamma}(K)=\frac{J_\gamma(K+\delta U)-J_\gamma(K-\delta U)}{2\delta},U).
The estimator is unbiased and its variance can be bounded in terms of the number of rollouts N.
The overall procedure consists of an outer loop that gradually increases the discount factor (γ_{k+1} = (1+ζα_k)γ_k) and an inner loop that applies standard zeroth‑order PG updates
(K_{j+1}=K_j-\eta \widehat{\nabla J_{\gamma_k}}(K_j))
until the estimated gradient norm falls below a prescribed threshold (2ε/3). By ensuring the estimation error is ≤ε/3, the true gradient norm is guaranteed to be ≤ε, which, together with the local smoothness and Lipschitz properties of J_γ on a sublevel set, implies that the resulting gain lies in S_γ. As γ approaches 1, S_γ converges to the set of stabilizing gains for the original (undamped) system, thus delivering a stabilizing SOF controller for the true dynamics.
Technical contributions include:
-
Local regularity analysis – Lemmas 1‑4 establish that on any sublevel set (S_\gamma(\nu)={K: J_\gamma(K)\le\nu}) the cost is locally (L,D)‑smooth and (G,D)‑Lipschitz, and the gradient norm is uniformly bounded by a constant G₀. These constants are expressed explicitly in terms of system‑level bounds (spectral radius, norms of B and C, bounds on Q,R).
-
Strong stability link – By adapting results from the full‑state case, the authors show that any K with bounded cost is (κ,ϱ)‑strongly stable, which yields explicit bounds on (|KC|) and (|A-BKC|). This property is crucial for guaranteeing that the damped closed‑loop matrix remains Schur stable throughout the algorithm.
-
Sample‑complexity bound – The two‑point estimator requires (N = O\big((L/\epsilon)^2\log(1/\delta)\big)) rollouts to achieve estimation error ≤ε/3 with probability 1‑δ. The outer discount loop needs (O(\log(1/(1-\gamma_0)))) iterations to bring γ close enough to 1. Consequently, the total number of system trajectories scales polynomially with the state dimension n, input dimension m, output dimension p, and inversely with the desired accuracy ε.
-
Algorithmic design without global convergence guarantees – Since gradient dominance does not hold for SOF, the method does not aim for global optimality. Instead, convergence to a stationary point with small gradient norm is sufficient to certify stability, a novel perspective for model‑free output‑feedback control.
Empirical validation is performed on several low‑dimensional linear systems (n=2–4). Starting from random unstable gains and a modest initial discount factor, the algorithm consistently finds a stabilizing SOF gain within a few hundred rollouts. Comparisons with a model‑based “identify‑then‑design” baseline show that the proposed model‑free approach attains stabilization faster and is robust to measurement noise, confirming the theoretical claims.
In summary, the paper delivers the first model‑free PG framework that can learn a stabilizing static output‑feedback controller for unknown linear systems. By cleverly integrating a discount‑based damping scheme with zeroth‑order gradient estimation, it provides explicit sample‑complexity guarantees and practical convergence to a stabilizing policy, despite the lack of gradient dominance. The work opens avenues for extending model‑free output‑feedback methods to higher‑dimensional, possibly nonlinear, and partially observed control problems.
Comments & Academic Discussion
Loading comments...
Leave a Comment