Equilibrium Selection for Multi-agent Reinforcement Learning: A Unified Framework
While multi-agent reinforcement learning (MARL) has produced numerous algorithms that converge to Nash or related equilibria, such equilibria are often non-unique and can exhibit widely varying efficiency. This raises a fundamental question: how can one design learning dynamics that not only converge to equilibrium but also select equilibria with desirable performance, such as high social welfare? In contrast to the MARL literature, equilibrium selection has been extensively studied in normal-form games, where decentralized dynamics are known to converge to potential-maximizing or Pareto-optimal Nash equilibria (NEs). Motivated by these results, we study equilibrium selection in finite-horizon stochastic games. We propose a unified actor-critic framework in which a critic learns state-action value functions, and an actor applies a classical equilibrium-selection rule state-wise, treating learned values as stage-game payoffs. We show that, under standard stochastic stability assumptions, the stochastically stable policies of the resulting dynamics inherit the equilibrium selection properties of the underlying normal-form learning rule. As consequences, we obtain potential-maximizing policies in Markov potential games and Pareto-optimal (Markov perfect) equilibria in general-sum stochastic games, together with sample-based implementation of the framework.
💡 Research Summary
The paper tackles a central issue in multi‑agent reinforcement learning (MARL): while many algorithms guarantee convergence to a Nash equilibrium (NE) or related solution concepts, the equilibria are often non‑unique and can differ dramatically in terms of social welfare. In contrast, the literature on normal‑form games has long studied equilibrium selection via stochastic stability: when agents occasionally make mistakes (parameterized by a small noise level ε), only a subset of equilibria remains robust in the long run, known as stochastically stable equilibria (SSE). The authors bridge these two strands by proposing a unified actor‑critic framework that lifts equilibrium‑selection dynamics from normal‑form games to finite‑horizon stochastic games (SGs).
Key ingredients of the framework are: (1) a critic that learns state‑action value functions Qᵢ,ₕ(s,a) for each agent i, stage h, and state s under the current joint policy; (2) an actor that, at every state‑stage pair, treats the learned Q‑values as the payoff matrix of a normal‑form game and applies a pre‑specified stochastic‑stability‑based learning rule (e.g., log‑linear learning, the content‑discontent dynamics of
Comments & Academic Discussion
Loading comments...
Leave a Comment