Distributed scalable coupled policy algorithm for networked multi-agent reinforcement learning
This paper studies networked multi-agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent’s reward depends on its own state-action pair as well as those of its direct neighbors, and each agent’s policy is parameterized by its local parameters together with those of its $κ_{p}$-hop neighbors, with $κ_{p}\geq 1$ denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors’ averaged $Q$-function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state-action pairs of its $κ_{p}$-hop neighbors and the rewards of its $(κ_{p}+1)$-hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2-horizon sampling method that does not require storing a full $Q$-table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents’ parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push-sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first-order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state-of-the-art methods.
💡 Research Summary
This paper tackles a novel and challenging problem in Networked Multi-Agent Reinforcement Learning (NMARL): learning with interdependent rewards and, crucially, coupled policies. In this setting, an agent’s reward depends on the state-action pairs of itself and its direct neighbors. More significantly, an agent’s policy is parameterized not only by its own local parameters but also by those of its κ_p-hop neighbors, where κ_p ≥ 1 is the coupling radius. This coupling reflects realistic scenarios where agents’ decisions are inherently interdependent, such as in traffic signal control. The collective goal is to maximize the discounted average cumulative reward through collaborative policy optimization.
The authors identify two key limitations in prior work: 1) Scalable methods using truncated Q-functions often rely on tabular representations, leading to high storage demands and error propagation. 2) Most algorithms assume independent policies, which is impractical for many cooperative tasks. To address these, the paper makes several foundational contributions.
First, it introduces a novel theoretical construct called the “neighbors’ averaged Q-function.” Leveraging the independence of local state transitions, Lemma 1 shows that the global Q-function can be exactly decomposed into a sum of local Q-functions, each dependent only on an agent’s neighborhood. Building on this, Theorem 1 derives a new, exact expression for the coupled policy gradient. This gradient for agent i depends on the policies of its κ_p-hop neighbors and the local Q-functions of its (κ_p+1)-hop neighbors, precisely quantifying the influence of policy coupling.
The core algorithmic contribution is the Distributed Scalable Coupled Policy (DSCP) algorithm. Its design is guided by the principles of decentralization and scalability. Each agent i requires only information from its κ_p-hop neighborhood (states, actions) and (κ_p+1)-hop neighborhood (rewards), eliminating the need for global knowledge. A key innovation is the use of a geometric 2-horizon sampling technique. Instead of estimating or storing full Q-tables, the algorithm uses two random horizons (T1, T2) drawn from a geometric distribution to obtain an unbiased estimate of the policy gradient directly from sampled trajectories, significantly reducing computational and storage overhead.
For coordination, DSCP employs a distributed consensus mechanism. Each agent maintains local estimates of all other agents’ policy parameters, which it uses to execute its own policy and collect samples. To learn the true parameters, agents communicate only with their direct neighbors using a push-sum protocol. This protocol ensures that both the actual policy parameters and their local estimates are consistently updated and disseminated across the network, enabling coordinated policy improvement without centralization.
The paper provides rigorous convergence guarantees for DSCP. The analysis proves that: 1) Each agent’s local update uses an unbiased estimate of the gradient of the executed joint policy (Lemma 2). 2) Each agent’s local estimates of others’ parameters converge to their true values (Theorem 3). 3) Most importantly, the joint policy generated by the DSCP algorithm converges to a first-order stationary point of the global objective function J(θ) (Theorem 4). This provides a solid theoretical foundation often missing in prior empirical work on coupled policies.
Finally, the effectiveness of DSCP is validated through simulations in a multi-robot path planning environment within a grid world. Agents must navigate from start to goal locations while avoiding collisions. Experimental results demonstrate that DSCP achieves a higher average reward and faster convergence compared to state-of-the-art baselines, including independent policy-based distributed actor-critic and scalable truncated Q-function learning methods. The results confirm that explicitly modeling and learning with coupled policies, as done in DSCP, leads to superior cooperative performance in networked multi-agent systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment