Reinforcement Learning Method for Zero-Sum Linear-Quadratic Stochastic Differential Games in Infinite Horizons
In this work, we propose, for the first time, a reinforcement learning framework specifically designed for zero-sum linear-quadratic stochastic differential games. This approach offers a generalized solution for scenarios in which accurate system parameters are difficult to obtain, thereby overcoming a key limitation of traditional iterative methods that rely on complete system information. In correspondence with the game-theoretic algebraic Riccati equations associated with the problem, we develop both semi-model-based and model-free reinforcement learning algorithms by combining an iterative solution scheme with dynamic programming principles. Notably, under appropriate rank conditions on data sampling, the convergence of the proposed algorithms is rigorously established through theoretical analysis. Finally, numerical simulations are conducted to verify the effectiveness and feasibility of the proposed method.
💡 Research Summary
This paper introduces, for the first time, a reinforcement‑learning (RL) framework tailored to zero‑sum linear‑quadratic stochastic differential games (ZSLQSDG) defined over an infinite horizon. Classical approaches to such games rely on solving the game‑theoretic algebraic Riccati equation (GT‑ARE) and therefore require full knowledge of the system matrices (drift, control, and diffusion terms). In many practical settings—such as finance, power systems, or autonomous robotics—accurate identification of these parameters, especially the diffusion matrices that couple the stochastic disturbances to both players’ controls, is infeasible. The authors bridge this gap by embedding the GT‑ARE solution process into a reinforcement‑learning scheme that can operate with partial or no model information.
The paper first formalizes the ZSLQSDG problem: a controlled stochastic differential equation with two control inputs (one for each player) and a quadratic cost that is shared (maximizer vs. minimizer). The Nash equilibrium of the game is known to correspond to a stabilizing solution of a coupled GT‑ARE. The authors then develop a “nested iteration” structure: an outer loop updates the feedback gains of both players, thereby redefining a transformed linear‑stochastic system; an inner loop performs a classic policy‑evaluation / policy‑improvement cycle on this transformed system, solving a standard Riccati equation for a temporary value matrix Z and updating the second player’s gain accordingly. This nested scheme reproduces the well‑known iterative solution of the GT‑ARE when all matrices are known.
The novelty lies in three algorithmic variants that progressively relax the need for model knowledge:
- Fully model‑based nested iteration (Algorithm 1) serves as a baseline and provides the theoretical convergence foundation.
- On‑policy semi‑model‑based RL (Algorithm 2) assumes the drift and control matrices (A, B₁, B₂) are known, but the diffusion matrices (Cₗ, Dₗ,i) are unknown. By collecting state‑control trajectories under a stabilizing policy, the algorithm constructs linear equations (e.g., Eq. 12) that relate the unknown value matrix to observable data. The authors introduce the svec operator to exploit symmetry and reduce dimensionality, and define sampling matrices (δₓₓ, Iₓₓ, etc.) whose full rank guarantees that the least‑squares solution converges to the true GT‑ARE solution.
- Fully model‑free off‑policy RL (Algorithm 3) makes no assumption about any system matrix. It simultaneously learns the value function and both players’ feedback gains from off‑policy data, using a combination of temporal‑difference‑style updates and least‑squares estimation of the Riccati terms. The algorithm alternates between estimating the value matrix (policy evaluation) and updating the gains (policy improvement) without ever requiring explicit knowledge of A, B, C, or D.
A rigorous convergence analysis is provided for all three algorithms. The authors prove that, under a rank condition on the collected data (essentially, that the sampled state‑control pairs span the appropriate subspace), the sequence of estimated matrices {P(k)} converges to the unique stabilizing solution of the GT‑ARE. This result extends existing convergence guarantees for deterministic linear‑quadratic games to the stochastic, continuous‑time, zero‑sum setting.
Numerical experiments illustrate the practicality of the proposed methods. A two‑dimensional state system with one‑dimensional controls is simulated, and the model parameters are deliberately perturbed (up to 30 % error) to emulate real‑world uncertainty. The semi‑model‑based algorithm achieves convergence to the true Nash equilibrium with as few as 50 sample trajectories, while the fully model‑free algorithm converges within 100 trajectories, matching or surpassing the speed of classic model‑based policy iteration. Moreover, the learned policies yield a lower average cost (5–8 % improvement) compared with a naïve model‑based solution that uses the perturbed parameters.
In summary, the paper makes four key contributions: (i) it pioneers an RL‑based solution methodology for continuous‑time zero‑sum stochastic differential games; (ii) it designs both semi‑model‑based and fully model‑free algorithms that operate under realistic information constraints; (iii) it establishes rigorous convergence guarantees based on data rank conditions; and (iv) it validates the approach through simulations that demonstrate robustness to model uncertainty. These results open new avenues for applying RL to adversarial control problems, H∞ control, and other domains where stochastic dynamics and incomplete models are the norm.
Comments & Academic Discussion
Loading comments...
Leave a Comment