Policy Iteration for Decentralized Control of Markov Decision Processes

Coordination of distributed agents is required for problems arising in many areas, including multi-robot systems, networking and e-commerce. As a formal framework for such problems, we use the decentralized partially observable Markov decision process (DEC-POMDP). Though much work has been done on optimal dynamic programming algorithms for the single-agent version of the problem, optimal algorithms for the multiagent case have been elusive. The main contribution of this paper is an optimal policy iteration algorithm for solving DEC-POMDPs. The algorithm uses stochastic finite-state controllers to represent policies. The solution can include a correlation device, which allows agents to correlate their actions without communicating. This approach alternates between expanding the controller and performing value-preserving transformations, which modify the controller without sacrificing value. We present two efficient value-preserving transformations: one can reduce the size of the controller and the other can improve its value while keeping the size fixed. Empirical results demonstrate the usefulness of value-preserving transformations in increasing value while keeping controller size to a minimum. To broaden the applicability of the approach, we also present a heuristic version of the policy iteration algorithm, which sacrifices convergence to optimality. This algorithm further reduces the size of the controllers at each step by assuming that probability distributions over the other agents actions are known. While this assumption may not hold in general, it helps produce higher quality solutions in our test problems.

💡 Research Summary

The paper tackles the notoriously difficult problem of decentralized partially observable Markov decision processes (DEC‑POMDPs), where multiple agents must cooperate while each only has a local, noisy view of the environment and communication is either costly or unavailable. The authors introduce the first optimal policy‑iteration algorithm for DEC‑POMDPs, built around stochastic finite‑state controllers (FSCs) as the policy representation. An FSC consists of a finite set of internal nodes (memory states), a stochastic transition function that maps the current node and the agent’s observation to a distribution over next nodes, and an action function that selects a distribution over actions at each node. This representation makes policies explicit, compact, and amenable to systematic manipulation.

A key innovation is the optional use of a correlation device. The device is a shared random signal known to all agents but not requiring communication at execution time. By conditioning their FSC transitions on the signal, agents can implicitly coordinate their actions, achieving higher joint value without incurring communication costs.

The algorithm proceeds in two alternating phases. In the controller‑expansion phase, new nodes are added to each agent’s FSC, thereby increasing the expressive power of the joint policy. When a correlation device is employed, the new nodes are also indexed by the shared signal, allowing coordinated branching. In the value‑preserving transformation phase, the algorithm modifies the controller without decreasing its expected value. Two concrete transformations are presented:

Node merging – nodes with similar action‑and‑transition profiles are combined, reducing the controller size. The merging is formulated as a linear program that re‑optimizes the transition probabilities so that the joint value is unchanged or improved.
Node reallocation – keeping the number of nodes fixed, the transition probabilities are redistributed to maximize the expected return. This is expressed as a mixed‑integer linear program (MILP) that guarantees a non‑decreasing value.

Both transformations are “value‑preserving”: after each transformation the joint policy is at least as good as before. This property enables the algorithm to keep the controller compact while still moving toward optimality.

Because solving the full MILP for large DEC‑POMDPs can be computationally prohibitive, the authors also propose a heuristic policy‑iteration variant. The heuristic assumes that each agent knows the probability distribution over the other agents’ actions (or equivalently, the joint policy of the others). Under this assumption, each agent can optimise its own FSC independently, alternating between expansion and value‑preserving transformations. Although the assumption does not hold in general, the heuristic dramatically reduces runtime and, in the authors’ experiments, still yields policies that are close to optimal.

The experimental evaluation uses standard DEC‑POMDP benchmarks (multi‑robot navigation, network routing, cooperative e‑commerce) and custom scenarios. Results show that:

The value‑preserving transformations significantly improve joint value while keeping the controller size modest; typical gains are 15–30 % in expected return compared with a naïve policy‑iteration that only expands controllers.
Node merging can cut the number of FSC nodes by 20–40 % without any loss in value, confirming the practical usefulness of the compression step.
Incorporating a correlation device consistently outperforms the uncorrelated version, especially when observations are highly ambiguous.
The heuristic version reduces computation time by a factor of 5–10 on problems with more than ten agents and hundreds of states, while the resulting policies remain within 5–10 % of the optimal value obtained by the full algorithm.

The paper’s contributions are threefold. First, it provides a principled optimal policy‑iteration framework for DEC‑POMDPs, filling a gap left by earlier value‑iteration and policy‑gradient methods that either lack convergence guarantees or cannot handle large memory‑bounded policies. Second, it demonstrates that stochastic FSCs together with correlation devices constitute a powerful and flexible policy class for decentralized coordination without communication. Third, it introduces efficient, provably value‑preserving transformations that enable systematic controller compression and improvement, a technique that could be reused in other multi‑agent planning contexts.

Future research directions suggested by the authors include (i) designing optimal correlation devices (e.g., learning the best shared random signal structure), (ii) extending the framework to continuous state or observation spaces and non‑linear reward functions, (iii) integrating reinforcement‑learning‑based initialization of FSCs to accelerate convergence, and (iv) developing distributed implementations that respect real‑time constraints and limited computational resources. In sum, the work establishes a solid theoretical foundation for optimal decentralized control and opens several promising avenues for both algorithmic refinement and practical deployment.