We analyze independent policy-gradient (PG) learning in $N$-player linear-quadratic (LQ) stochastic differential games. Each player employs a distributed policy that depends only on its own state and updates the policy independently using the gradient of its own objective. We establish global linear convergence of these methods to an equilibrium by showing that the LQ game admits an $α$-potential structure, with $α$ determined by the degree of pairwise interaction asymmetry. For pairwise-symmetric interactions, we construct an affine distributed equilibrium by minimizing the potential function and show that independent PG methods converge globally to this equilibrium, with complexity scaling linearly in the population size and logarithmically in the desired accuracy. For asymmetric interactions, we prove that independent projected PG algorithms converge linearly to an approximate equilibrium, with suboptimality proportional to the degree of asymmetry. Numerical experiments confirm the theoretical results across both symmetric and asymmetric interaction networks.
Can multi-agent reinforcement learning (MARL) algorithms reliably and efficiently learn Nash equilibria (NEs) in N -player non-cooperative stochastic differential games? These games model strategic interactions among multiple players, each controlling a stochastic differential system and optimizing an objective influenced by the actions of all players. They arise naturally in diverse fields, including autonomous driving [8], neural science [4], ecology [33], and optimal trading [30]. A central goal is the computation of NEs, policy profiles from which no player can improve its payoff through unilateral deviation. In many settings, such equilibria are analytically intractable, motivating growing interest in learning-based approaches that approximate NEs from data collected through repeated interactions with the environment.
Despite this promise, MARL algorithms with theoretical performance guarantees for stochastic differential games are still limited, due to three fundamental challenges inherent in multi-agent interactions. First, scalability becomes a critical issue even with a moderate number of players, as the complexity of the joint strategy space grows exponentially with the population size [18,12]. Second, each individual player faces a non-stationary environment, as other players are simultaneously learning and adapting their policies. The analysis of MARL algorithms thus necessitates novel game-theoretic techniques beyond the single-agent setting. Third, differential games typically involve continuous time and continuous state or action spaces, requiring the use of function approximation for policies. The choice of policy parameterization can significantly affect both the efficiency and convergence of MARL algorithms.
As an initial step to tackle the aforementioned challenges, this work investigates the convergence of policy gradient (PG) algorithms for N -player linear-quadratic (LQ) games. LQ games play a fundamental role in dynamic game theory, and serve as benchmark problems for examining the performance of MARL algorithms [28,17,20,31]. Despite their relative tractability, existing work shows that gradient-based learning algorithms may fail to converge to NEs [28], or converge only under restrictive contraction conditions that do not scale well with the population size or the time horizon [17,31].
Independent learning with distributed policies. To ensure scalability, we adopt distributed (also known as decentralized) policies and focus on independent PG methods. A distributed policy means that each player bases its feedback control solely on its own state, without considering the states of other players. This approach reduces the need for inter-player communication and simplifies the policy parameterization. Independent PG algorithms further assume that each player updates its policy independently and concurrently by following the gradient of its own objective. Combining independent learning with distributed policies ensures that the computational complexity of the algorithm scales linearly with the number of players.
Although distributed policies are widely used in MARL [18], the existence and characterization of NEs in the resulting N -player games have not been rigorously studied, and no theoretical guarantees currently exist for the corresponding PG methods. An exception arises in mean field games, where players are assumed to interact symmetrically and weakly through empirical measures. Under this assumption, approximate distributed NEs for N -player games can be constructed via a limiting continuum model as N → ∞ (see e.g., [5,24,31]). However, in many realistic settings, players are not exchangeable and instead interact only with subsets of players specified by an underlying network. Moreover, the influence of each player on others may not vanish as the population size grows [8]. This motivates the study of general interaction structures under which distributed NEs can be characterized and learned.
Our contributions. This work provides non-asymptotic performance guarantees for independent PG algorithms in approximating distributed NEs in a class of finite-horizon N -player LQ games, inspired by flocking models [23,14] and opinion formation models [2]. In this game, each player i chooses a distributed policy to linearly control the drift of its state dynamics and minimizes a cost that is quadratic in its own control and in the states of all players, with the influence of the product of players j and k’s states weighted by Q i k,j (see (2.1)-(2.2) for precise definitions). We analyze this LQ game and the associated learning algorithms by extending the α-potential game framework developed in [14,13,15] to closed-loop games with distributed policies. In an α-potential game, when a player changes its strategy, the resulting change in its objective aligns with the change in an α-potential function up to an error α.
• We prove that the LQ game is an α-potential game, with an α-potential f
This content is AI-processed based on open access ArXiv data.