Co2PO: Coordinated Constrained Policy Optimization for Multi-Agent RL

Co2PO: Coordinated Constrained Policy Optimization for Multi-Agent RL
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Constrained multi-agent reinforcement learning (MARL) faces a fundamental tension between exploration and safety-constrained optimization. Existing leading approaches, such as Lagrangian methods, typically rely on global penalties or centralized critics that react to violations after they occur, often suppressing exploration and leading to over-conservatism. We propose Co2PO, a novel MARL communication-augmented framework that enables coordination-driven safety through selective, risk-aware communication. Co2PO introduces a shared blackboard architecture for broadcasting positional intent and yield signals, governed by a learned hazard predictor that proactively forecasts potential violations over an extended temporal horizon. By integrating these forecasts into a constrained optimization objective, Co2PO allows agents to anticipate and navigate collective hazards without the performance trade-offs inherent in traditional reactive constraints. We evaluate Co2PO across a suite of complex multi-agent safety benchmarks, where it achieves higher returns compared to leading constrained baselines while converging to cost-compliant policies at deployment. Ablation studies further validate the necessity of risk-triggered communication, adaptive gating, and shared memory components.


💡 Research Summary

Co2PO (Coordinated Constrained Policy Optimization) tackles a core dilemma in constrained multi‑agent reinforcement learning: how to explore efficiently while respecting safety constraints. Traditional constrained MARL methods—most notably Lagrangian‑based approaches—apply penalties or centralized critics only after a constraint violation occurs. This reactive stance suppresses exploration, often leading to overly conservative policies that converge to sub‑optimal returns.

The authors propose a proactive, communication‑augmented framework that leverages risk‑aware coordination. Each agent runs a hazard predictor that, from its local observation, outputs a logit ℓ and a probability p = σ(ℓ) estimating the chance of incurring a cost above a threshold δ within a look‑ahead horizon H. When p exceeds a dynamically adjusted threshold τ, the agent writes a compact message to a shared blackboard. The message consists of: (1) a low‑dimensional state summary x, (2) a learned intent vector u, (3) a yield flag y (indicating willingness to yield), (4) the hazard probability p, and (5) a binary write indicator w. This “risk‑triggered write” ensures that communication occurs only when an agent perceives a heightened near‑term safety risk, dramatically reducing unnecessary bandwidth usage.

Reading from the blackboard is also selective. Each agent forms a query from its own state summary and computes cosine similarity with all active entries written by other agents in the same environment instance. The top‑k most similar entries are retrieved, concatenated into a fixed‑size context vector m, and fed—together with the local observation—into the policy network. Thus, agents condition their actions on the most relevant peers’ intents and yield signals, enabling coordinated maneuvers such as yielding, overtaking, or joint navigation through hazardous zones.

Training follows the centralized‑training‑decentralized‑execution (CTDE) paradigm. A Lagrangian objective L(π, λ) = J_R(π) − λ(J_C(π) − d) is optimized using a hybrid advantage A_hyb = A_R − λA_C. The actor loss incorporates three additional terms: (i) a write penalty α_write · E


Comments & Academic Discussion

Loading comments...

Leave a Comment