Convex Markov Games and Beyond: New Proof of Existence, Characterization and Learning Algorithms for Nash Equilibria

Convex Markov Games and Beyond: New Proof of Existence, Characterization and Learning Algorithms for Nash Equilibria
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Convex Markov Games (cMGs) were recently introduced as a broad class of multi-agent learning problems that generalize Markov games to settings where strategic agents optimize general utilities beyond additive rewards. While cMGs expand the modeling frontier, their theoretical foundations, particularly the structure of Nash equilibria (NE) and guarantees for learning algorithms, are not yet well understood. In this work, we address these gaps for an extension of cMGs, which we term General Utility Markov Games (GUMGs), capturing new applications requiring coupling between agents’ occupancy measures. We prove that in GUMGs, Nash equilibria coincide with the fixed points of projected pseudo-gradient dynamics (i.e., first-order stationary points), enabled by a novel agent-wise gradient domination property. This insight also yields a simple proof of NE existence using Brouwer’s fixed-point theorem. We further show the existence of Markov perfect equilibria. Building on this characterization, we establish a policy gradient theorem for GUMGs and design a model-free policy gradient algorithm. For potential GUMGs, we establish iteration complexity guarantees for computing approximate-NE under exact gradients and provide sample complexity bounds in both the generative model and on-policy settings. Our results extend beyond prior work restricted to zero-sum cMGs, providing the first theoretical analysis of common-interest cMGs.


💡 Research Summary

This paper introduces General Utility Markov Games (GUMGs), a broad extension of the recently proposed Convex Markov Games (cMGs). While cMGs allow each agent to optimize a convex functional of its own occupancy measure, they forbid direct coupling between agents’ occupancy measures. GUMGs overcome this limitation by defining each agent’s utility F_i as a jointly concave function of the full vector of marginal occupancy measures (λ_1,…,λ_N) together with the other agents’ policies. This formulation subsumes standard Markov games, single‑agent convex RL, and cMGs, and captures new applications such as imitation‑consensus trade‑offs, team‑level aggregates, and diversity penalties.

The core technical contribution is the “agent‑wise gradient domination” property. The authors prove that for any policy profile π, a zero gradient with respect to an agent’s policy (∇{π_i} u_i(π)=0) implies that π_i is a best response to π{‑i}. Consequently, first‑order stationary points of the joint policy space coincide exactly with Nash equilibria (NE). Using this, they define a projected pseudo‑gradient map T(π)=Π_Δ(π−α∇_π u(π)) and show that T is continuous on the compact simplex of policies. Brouwer’s fixed‑point theorem then yields a simple existence proof for NE in GUMGs, avoiding the more intricate Kakutani or Debreu arguments previously required for cMGs.

Beyond existence, the paper establishes the existence of Markov‑perfect equilibria (MPE). By demonstrating that NE do not depend on the full‑support initial state distribution, the authors take a limit where the initial distribution concentrates on a single state, and, using continuity of utilities, they obtain an equilibrium that is state‑wise optimal—i.e., an MPE.

A policy‑gradient theorem is derived by exploiting the dynamic programming structure of occupancy measures. The gradient of each agent’s utility with respect to its policy can be expressed in terms of the marginal occupancy, transition kernel, and discount factor. Leveraging this, the authors propose a model‑free policy‑gradient algorithm (Algorithm 1) that performs projected gradient ascent using only on‑policy trajectory samples. Because of the gradient domination property, any limit point of the algorithm is guaranteed to be a Nash equilibrium.

The paper then focuses on the common‑interest subclass where utilities admit a potential function Φ(λ). In this setting, with exact gradients the algorithm finds an ε‑approximate NE in O(ε^{‑2}) iterations. When the transition model is unknown, two sampling regimes are analyzed. In a generative‑model setting, minibatch trajectory sampling yields a total sample complexity of O(ε^{‑4}). In an on‑policy setting, stochastic projected gradient ascent with minibatches achieves O(ε^{‑5}) samples. These bounds improve upon prior work on zero‑sum cMGs (which required O(ε^{‑6}) samples) and constitute the first sample‑efficient learning guarantees for common‑interest cMGs.

Related work is discussed extensively: prior cMG literature proved NE existence via Debreu’s theorem but offered no structural insight; recent work handled only zero‑sum two‑player cMGs. This paper generalizes to N‑player general‑sum games, provides a clear NE characterization, and delivers concrete learning algorithms with provable iteration and sample complexities.

In summary, the authors present a unified theoretical framework for multi‑agent reinforcement learning with general, possibly non‑additive utilities. They introduce a novel gradient domination argument that links first‑order optimality to Nash equilibrium, use Brouwer’s theorem for a clean existence proof, develop a policy‑gradient theorem and a practical model‑free algorithm, and deliver the first rigorous sample‑complexity results for potential (common‑interest) GUMGs. The work significantly expands the theoretical foundations of convex Markov games and opens avenues for future research on function approximation, offline learning, and hybrid cooperative‑adversarial settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment