Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

Actor-Critic Algorithms for Learning Nash Equilibria in N-player   General-Sum Games
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a $N$-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game - Sub-Problem) conditions. Using these conditions, we develop two actor-critic algorithms: OFF-SGSP (model-based) and ON-SGSP (model-free). Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima. We establish that both algorithms converge, in self-play, to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide with stationary NE of the underlying general-sum stochastic game. On a single state non-generic game (see Hart and Mas-Colell [2005]) as well as on a synthetic two-player game setup with $810,000$ states, we establish that ON-SGSP consistently outperforms NashQ ([Hu and Wellman, 2003] and FFQ [Littman, 2001] algorithms.


💡 Research Summary

The paper tackles the problem of computing stationary Nash equilibria (NE) in finite discounted general‑sum stochastic games with an arbitrary number of agents. Building on the nonlinear programming formulation of Filar and Vrieze (2004) for two‑player games, the authors extend the approach to N‑player settings, where the constraints become nonlinear as well. To make the problem tractable they decompose the global optimization into a collection of state‑agent sub‑problems, each enforcing that the Bellman error for a particular state and agent is zero.

A central contribution is the derivation of SG‑SP (Stochastic Game – Sub‑Problem) conditions. These are necessary and sufficient criteria that link the solutions of the sub‑problems to Nash equilibria of the original game: a policy profile satisfies SG‑SP if and only if it is a stationary NE. Using these conditions the authors design a special descent direction that avoids spurious local minima and guarantees convergence only to global minima of the original non‑convex program, i.e., to NE.

The algorithmic framework follows an actor‑critic architecture. The critic evaluates the value function for a fixed joint policy. In the model‑based variant (OFF‑SGSP) the critic performs exact policy evaluation via dynamic programming (value iteration) because the transition kernel is assumed known. In the model‑free variant (ON‑SGSP) the critic uses temporal‑difference (TD) learning to estimate values from sampled trajectories. The actor updates the joint policy using the SG‑SP‑derived descent direction. Both updates run on two different time‑scales (fast critic, slow actor) following the multi‑timescale stochastic approximation methodology.

Convergence is proved by first showing, via the Kushner‑Clark lemma, that the fast and slow recursions track the limit points of two coupled ordinary differential equations (ODEs). The authors then analyze the policy ODE, providing a simplified representation of its limiting set and proving that every asymptotically stable equilibrium of this ODE satisfies the SG‑SP conditions, and therefore corresponds to a stationary NE of the original game. Consequently, both OFF‑SGSP and ON‑SGSP converge (in self‑play) to NE.

Empirical evaluation is conducted on two benchmarks. The first is a single‑state non‑generic game from Hart and Mas‑Colell (2005) that possesses a pure and a mixed NE. ON‑SGSP consistently converges to an NE, whereas NashQ (Hu & Wellman, 2003) and FFQ (Littman, 2001) frequently fail to converge. The second benchmark is a synthetic “stick‑together” game with 810,000 states and two agents. ON‑SGSP converges in roughly 21 iterations per state, dramatically outperforming NashQ and FFQ in both speed and reliability.

Compared with prior work, the proposed methods have several advantages: (1) they provide a theoretical guarantee of convergence to a global NE without requiring per‑iteration solution of a bimatrix game or a linear program, which is a bottleneck for NashQ, FFQ, and many homotopy‑based approaches; (2) the per‑iteration computational complexity of ON‑SGSP scales linearly with the number of agents, making it suitable for large‑scale multi‑agent systems; (3) the framework applies to general‑sum stochastic games with multiple states and discounted infinite horizons, whereas many earlier gradient‑based methods are limited to repeated (single‑state) games.

In summary, the paper introduces a novel SG‑SP‑based decomposition of the NE computation problem, derives a principled descent direction, and implements two actor‑critic algorithms—one model‑based, one model‑free—that are provably convergent to stationary Nash equilibria in N‑player general‑sum stochastic games. The theoretical analysis, together with extensive experiments on both small and massive state spaces, demonstrates that ON‑SGSP offers a scalable and reliable alternative to existing multi‑agent reinforcement learning methods for equilibrium learning. Future directions include extensions to partially observable settings, non‑stationary opponents, and continuous state‑action spaces.


Comments & Academic Discussion

Loading comments...

Leave a Comment