Convergence and Connectivity: Dynamics of Multi-Agent Q-Learning in Random Networks
Beyond specific settings, many multi-agent learning algorithms fail to converge to an equilibrium solution, instead displaying complex, non-stationary behaviours such as recurrent or chaotic orbits. In fact, recent literature suggests that such complex behaviours are likely to occur when the number of agents increases. In this paper, we study Q-learning dynamics in network polymatrix normal-form games where the network structure is drawn from classical random graph models. In particular, we focus on the Erdős-Rényi model, which is used to analyze connectivity in distributed systems, and the Stochastic Block model, which generalizes the above by accounting for community structures that naturally arise in multi-agent systems. In each setting, we establish sufficient conditions under which the agents’ joint strategies converge to a unique equilibrium. We investigate how this condition depends on the exploration rates, payoff matrices and, crucially, the probabilities of interaction between network agents. We validate our theoretical findings through numerical simulations and demonstrate that convergence can be reliably achieved in many-agent systems, provided interactions in the network are controlled.
💡 Research Summary
The paper investigates the convergence properties of multi‑agent Q‑learning when agents interact through a networked polymatrix game whose underlying graph is drawn from classical random‑graph models. The authors focus on two widely used models: the Erdős–Rényi (ER) graph and the Stochastic Block Model (SBM). In both settings the game is assumed to be “identical‑interest”: every edge hosts the same bimatrix game, which allows the authors to define an “intensity of identical interests” δ_I as the maximum spectral norm of the sum of the two payoff matrices on any edge.
The continuous‑time limit of multi‑agent Q‑learning, called Q‑Learning Dynamics (QLD), is derived from the standard Q‑learning update together with a Boltzmann (soft‑max) action selection rule. QLD can be written as a perturbed replicator equation with an entropy term weighted by each agent’s exploration temperature T_k. Leonardos et al. (2024) showed that the fixed points of QLD coincide with the Quantal Response Equilibria (QRE) of the underlying game; as T_k → 0 the QRE converges to a Nash equilibrium.
The core theoretical contribution is a sufficient condition for global convergence of QLD to the unique QRE. Lemma 1 states that if all agents share the same exploration temperature T and learning rate α, then convergence is guaranteed whenever
T > (δ_I · ρ(G)) / (2 · min_k α_k),
where ρ(G) is the spectral radius (largest eigenvalue) of the adjacency matrix of the interaction graph. This inequality links three key ingredients: (i) the game’s payoff structure (δ_I), (ii) the agents’ exploration‑exploitation balance (T), and (iii) the connectivity of the network (ρ(G)).
To translate this condition into probabilistic statements for random graphs, the authors use known spectral bounds. For an ER graph G(N,p) the spectral radius satisfies ρ(G)≈Np (with high probability), i.e., it scales with the expected degree d̄ = (N‑1)p. Consequently, the required exploration temperature scales linearly with d̄. For the SBM, the spectral radius is bounded by the larger of the intra‑community and inter‑community expected degrees, d_in and d_out, which are functions of the block probabilities p_in and p_out and the community sizes. In both models the authors prove that, as N grows, the probability that the QLD converges to the unique QRE tends to one, provided the expected degree remains bounded below a threshold that depends on δ_I and the learning rates.
The empirical section validates the theory. Simulations involve up to 200 agents arranged in 5–10 communities, with p_in = 0.1 and p_out = 0.01 (low‑density SBM). Exploration temperatures are varied between 0.05 and 0.2, and learning rates are fixed at 0.1. When the average degree d̄ ≤ 5, convergence to the unique QRE occurs in more than 95 % of runs, even with relatively low exploration. As d̄ increases beyond ≈15, convergence rates drop sharply, and chaotic orbits appear, especially for low T. These observations corroborate earlier findings that non‑convergent dynamics become more prevalent as the number of agents grows, but they also demonstrate that controlling network sparsity can restore convergence.
To test robustness beyond the homogeneous‑edge assumption, the authors also simulate the “Conflict Network” game where each edge carries a different payoff matrix. The same qualitative dependence on average degree and exploration temperature persists, suggesting that the derived condition captures a fundamental mechanism rather than an artifact of the identical‑interest assumption.
In summary, the paper makes three major contributions: (1) it establishes a clear, analytically tractable link between Q‑learning convergence and the spectral properties of random interaction graphs; (2) it shows that low‑degree random networks (both ER and SBM) enable convergence even with modest exploration rates, thereby offering a design principle for large‑scale multi‑agent systems such as robotic swarms or sensor networks; and (3) it provides extensive simulations that confirm the theoretical predictions and demonstrate that the results extend to more heterogeneous payoff structures. The work opens several avenues for future research, including heterogeneous exploration rates, time‑varying graphs, and extensions to continuous‑action or partially observable settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment