Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning
The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).
💡 Research Summary
The paper tackles a fundamental obstacle in multi‑agent reinforcement learning (MARL): how to achieve coordinated behavior when agents cannot exchange messages. While prior work has relied on shared randomness—often formalized as a correlation device—to induce correlated policies, such approaches are limited to the convex hull of factorizable policies and cannot capture the richer correlations enabled by quantum mechanics. The authors therefore ask whether shared quantum entanglement can be exploited as a coordination resource that remains communication‑free yet strictly more expressive than shared randomness.
To answer this, the authors first construct a hierarchy of policy classes for decentralized cooperative MARL. At the bottom lies Π_F, the set of factorizable policies where each agent acts independently on its own observation‑action history. Above that is Π_SR, the set of policies that condition on a common random variable (shared randomness). The next level, Π_Q, consists of policies that are realized by agents sharing an entangled quantum state and performing local positive‑operator‑valued measurements (POVMs) on it. Finally, Π_NS denotes the non‑signaling policies, the most general class that respects the no‑communication constraint. The hierarchy satisfies Π_F ⊂ Π_SR ⊂ Π_Q ⊂ Π_NS ⊂ Π_C (the set of all joint policies), establishing that quantum‑entangled policies strictly extend the expressive power of classical shared randomness.
The technical contribution centers on making such quantum‑entangled policies learnable with gradient‑based reinforcement‑learning methods. The authors introduce QuantumSoftmax, a differentiable transformation that maps any square complex matrix to a valid POVM element. By applying QuantumSoftmax to a set of unconstrained matrices, the algorithm produces a full POVM that satisfies positivity and completeness, enabling back‑propagation through the measurement parameters. This makes it possible to optimize the measurement operators jointly with the rest of the policy network.
Building on this, the paper proposes a modular policy architecture: a quantum coordinator and local actors. The coordinator holds the shared quantum state (e.g., a Bell pair) and the parametrized POVMs; when a joint decision is required, it samples a measurement outcome (the “advice”) and broadcasts this classical advice to each agent. Each local actor then conditions its action distribution on both its private observation‑action history and the received advice. Because the advice is generated from a quantum measurement, the resulting joint policy belongs to Π_Q while still being executable in a fully decentralized manner. The authors embed this architecture into a modified version of MAPPO (Multi‑Agent Proximal Policy Optimization), allowing simultaneous learning of the quantum coordinator and the local actor networks.
Empirically, the framework is evaluated in two settings. First, classic single‑round non‑local games (e.g., CHSH, Magic Square) are presented as black‑box oracles. The learning algorithm discovers the optimal entangled strategies (e.g., using the Bell state and appropriate measurement bases) and achieves the known quantum advantage over any classical shared‑randomness policy. This demonstrates that the method can recover analytically known quantum strategies purely from interaction data.
Second, the authors apply the approach to a more realistic sequential decision‑making problem: a multi‑router, multi‑server queueing system modeled as a Dec‑POMDP. Each server observes only its local queue length and must decide where to forward incoming packets. The system is designed so that an entangled policy can reduce overall latency compared to the best classical correlated policy. Training with the quantum‑coordinator MAPPO yields policies that lower average waiting time by roughly 12 % and increase throughput by about 8 % relative to the strongest shared‑randomness baseline. Importantly, these gains are achieved without any inter‑agent signaling during execution, confirming that the quantum measurement outcomes provide the necessary coordination.
The discussion acknowledges practical challenges: current quantum hardware suffers from noise and limited qubit counts, which could impede real‑time deployment. The authors suggest a hybrid “offline‑training, online‑execution” scheme where the quantum parameters are learned in simulation and then transferred to a physical quantum device for inference. They also note that the non‑signaling class Π_NS still contains policies that cannot be realized even with entanglement, leaving room for future work on more general quantum resources (e.g., multipartite entanglement, quantum steering).
In summary, this work delivers the first end‑to‑end differentiable framework that lets decentralized agents learn to exploit shared quantum entanglement for coordination. By introducing QuantumSoftmax and a clean coordinator‑actor decomposition, it bridges quantum information theory and modern MARL, showing both theoretical expressiveness and practical performance gains in communication‑constrained environments such as high‑frequency trading or distributed control systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment