Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning
This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. Through DMCG, we dynamically compose what we refer to as \textit{meta coordination graphs}, to learn a more expressive representation of agent interactions and use them to integrate agent information through graph convolutional networks. The goal is to enable an evolving coordination graph to guide effective coordination in cooperative MARL tasks. The graphs are jointly optimized with agents’ value functions to learn to implicitly reason about joint actions, facilitating the end-to-end learning of interaction representations and coordinated policies. We demonstrate that DMCG consistently achieves state-of-the-art coordination performance and sample efficiency on challenging cooperative tasks, outperforming several prior graph-based and non-graph-based MARL baselines. Through several ablations, we also isolate the impact of individual components in DMCG, showing that the observed improvements are due to the meaningful design choices in this approach. We also include an analysis of its computational complexity to discuss its practicality in real-world applications. All codes can be found here: {\color{blue}{https://github.com/Nikunj-Gupta/dmcg-marl}.
💡 Research Summary
The paper introduces Deep Meta Coordination Graphs (DMCG), a novel framework for cooperative multi‑agent reinforcement learning (MARL) that learns both the interaction structure among agents and the policies that exploit this structure in an end‑to‑end fashion. Traditional coordination‑graph (CG) approaches factor the global Q‑function into per‑agent utilities and pairwise terms, but they typically rely on a fixed topology or a single soft adjacency matrix, limiting their ability to capture multiple, time‑varying interaction types (e.g., physical proximity, implicit signaling, strategic influence).
DMCG addresses this limitation by maintaining a set of K base relation graphs, each initialized as a complete graph and representing a distinct latent interaction type. For each of L composition layers and C parallel channels, the model learns attention weights α(l,c)_k that softly mix the K base graphs. The mixed adjacency for a channel at a layer is A(l,c)=∑_k α(l,c)_k A_k, and the final meta‑coordination graph for channel c is obtained by sequential matrix multiplication across the L layers: A(c)M = Π{ℓ=1}^L A(l,c). This construction is fully differentiable, allowing the graph structure to be optimized jointly with the agents’ value networks.
The meta‑graphs are then fed into a Graph Convolutional Network (GCN). Starting from the observation matrix X∈ℝ^{n×d}, the GCN updates node embeddings as H^{(ℓ+1)} = σ( A(c)_M·H^{(ℓ)}·W^{(ℓ)} ), where σ is a non‑linear activation and W^{(ℓ)} are trainable weight matrices. After several GCN layers, each agent obtains a rich embedding that captures information propagated through the dynamically composed interaction graph.
These embeddings are used to compute individual Q‑values Q_i and pairwise Q_{ij} in the classic CG factorization:
Q_tot = (1/|V|) Σ_i Q_i + (1/|E|) Σ_{(i,j)∈E} Q_{ij}.
Thus DMCG retains the interpretability and credit‑assignment benefits of CGs while leveraging the expressive power of modern GNNs.
Training follows the centralized‑training‑decentralized‑execution (CTDE) paradigm. A joint TD‑error loss is back‑propagated through both the Q‑networks and the graph‑generation modules, enabling the meta‑graphs to adapt as the task reveals which agents truly influence each other.
Empirical evaluation spans several challenging cooperative benchmarks: StarCraft II micromanagement maps (e.g., 2s3z, 3m, corridor), Pursuit, Lift, and other robot‑swarm tasks. DMCG consistently outperforms strong baselines such as VDN, QMIX, QTRAN, Deep Coordination Graphs (DCG), Deep Implicit Coordination Graphs (DICG), and recent GNN‑based MARL methods. Gains are especially pronounced in environments where coordination is non‑trivial (e.g., rewards only when multiple agents act simultaneously).
Ablation studies dissect the contributions of (1) the number of base graphs K, (2) the depth of composition L, (3) the number of channels C, and (4) joint optimization of graphs and value functions. Removing any component degrades performance, confirming that dynamic graph composition, multi‑channel mixing, and end‑to‑end training are all essential.
Complexity analysis shows that graph generation costs O(K·C·n²) memory and O(C·n·d·L) compute per GCN pass. The authors mitigate overhead by employing sparse matrix operations and weight sharing, achieving real‑time inference (≈10 Hz) on standard GPU hardware even for n≈30 agents. The method scales linearly with the number of agents, making it suitable for larger swarms, though the authors note that very large n may require hierarchical clustering extensions.
In summary, DMCG advances MARL by (i) learning a flexible, multi‑type interaction graph that evolves with training, (ii) integrating this graph with graph convolutions to produce expressive agent representations, and (iii) jointly optimizing the graph and policy to satisfy the Individual‑Greedy‑Maximization (IGM) principle. The framework delivers superior sample efficiency and final performance across a suite of cooperative tasks, offering a promising direction for real‑world multi‑robot, autonomous‑vehicle, and drone‑swarm applications where coordination dynamics are complex and not known a priori.
Comments & Academic Discussion
Loading comments...
Leave a Comment