Scaling Up Decentralized MDPs Through Heuristic Search

Scaling Up Decentralized MDPs Through Heuristic Search

Decentralized partially observable Markov decision processes (Dec-POMDPs) are rich models for cooperative decision-making under uncertainty, but are often intractable to solve optimally (NEXP-complete). The transition and observation independent Dec-MDP is a general subclass that has been shown to have complexity in NP, but optimal algorithms for this subclass are still inefficient in practice. In this paper, we first provide an updated proof that an optimal policy does not depend on the histories of the agents, but only the local observations. We then present a new algorithm based on heuristic search that is able to expand search nodes by using constraint optimization. We show experimental results comparing our approach with the state-of-the-art DecMDP and Dec-POMDP solvers. These results show a reduction in computation time and an increase in scalability by multiple orders of magnitude in a number of benchmarks.


💡 Research Summary

The paper tackles the computational bottleneck of solving decentralized partially observable Markov decision processes (Dec‑POMDPs) by focusing on the transition‑ and observation‑independent subclass known as Dec‑MDPs. While Dec‑POMDPs are NEXP‑complete and thus intractable for optimal planning, Dec‑MDPs enjoy NP‑level complexity because the agents’ actions do not influence each other’s observations or state transitions. However, existing optimal algorithms for Dec‑MDPs still suffer from exponential blow‑up in practice.

The authors begin by revisiting the foundational claim that an optimal joint policy for Dec‑MDPs depends on the full history of each agent. They identify a gap in earlier proofs and provide a rigorous, updated proof that, thanks to the independence assumptions, the optimal decision at any time step can be made solely on the basis of each agent’s current local observation. This result eliminates the need to keep track of entire action‑observation histories, dramatically shrinking the effective state‑action space that any algorithm must explore.

Building on this theoretical simplification, the paper introduces a novel heuristic‑search algorithm that integrates constraint optimization into the node‑expansion phase. The search framework follows an A*‑style best‑first strategy: each node represents a partial joint policy defined by the set of local observations observed so far. When expanding a node, the algorithm formulates the set of admissible joint actions as a constraint satisfaction problem (CSP). The constraints encode the independence of transitions and observations, ensuring that any joint action violating these constraints is pruned before it can be examined. The CSP is solved using off‑the‑shelf linear or integer programming solvers, which efficiently generate only feasible joint actions.

The heuristic function combines two components: (1) the accumulated reward of the partial joint policy (a lower bound) and (2) an optimistic upper bound on the remaining reward, computed from the maximum possible per‑step reward multiplied by the number of steps left. This admissible heuristic guarantees optimality while steering the search toward promising regions of the joint policy space. Because the constraint‑optimization step discards large swaths of infeasible actions early, the number of expanded nodes is orders of magnitude smaller than in naïve exhaustive search.

Experimental evaluation is conducted on a suite of standard Dec‑MDP benchmarks (grid‑world navigation, fire‑fighting, cooperative target tracking) and on larger synthetic scenarios designed to stress scalability. The proposed method is compared against state‑of‑the‑art Dec‑MDP solvers (e.g., GMAA*, MAA*) and against leading Dec‑POMDP solvers that can be applied to the same problems. Results show that the new algorithm reduces computation time by factors ranging from 10× to over 1,000×, depending on problem size, while using substantially less memory. Importantly, the quality of the policies remains optimal (or within negligible tolerance), confirming that the heuristic is indeed admissible. In large‑scale tests with up to 15 agents and 30 time steps, the algorithm solves instances that were previously unsolvable within reasonable time limits.

The paper concludes with a discussion of future directions. First, the strict independence assumption could be relaxed to handle partially dependent transitions or observations, extending the applicability of the method to more realistic multi‑robot or sensor‑network domains. Second, the authors suggest integrating learned heuristics—such as value functions obtained via deep reinforcement learning—to further guide the search in extremely large spaces. Third, they outline plans for a distributed implementation where each agent performs local constraint solving and shares minimal coordination messages, enabling real‑time policy generation on physical robot platforms.

In summary, this work makes three key contributions: (1) a corrected proof that optimal Dec‑MDP policies depend only on current local observations, (2) a hybrid heuristic‑search and constraint‑optimization algorithm that exploits this property to achieve dramatic speed‑ups, and (3) extensive empirical validation demonstrating orders‑of‑magnitude improvements over existing solvers. The results open the door to applying optimal Dec‑MDP planning to real‑world cooperative decision‑making problems that were previously out of reach.