Distributed Planning in Hierarchical Factored MDPs

Distributed Planning in Hierarchical Factored MDPs

We present a principled and efficient planning algorithm for collaborative multiagent dynamical systems. All computation, during both the planning and the execution phases, is distributed among the agents; each agent only needs to model and plan for a small part of the system. Each of these local subsystems is small, but once they are combined they can represent an exponentially larger problem. The subsystems are connected through a subsystem hierarchy. Coordination and communication between the agents is not imposed, but derived directly from the structure of this hierarchy. A globally consistent plan is achieved by a message passing algorithm, where messages correspond to natural local reward functions and are computed by local linear programs; another message passing algorithm allows us to execute the resulting policy. When two portions of the hierarchy share the same structure, our algorithm can reuse plans and messages to speed up computation.


💡 Research Summary

The paper introduces a principled, scalable planning framework for collaborative multi‑agent dynamical systems by exploiting a hierarchical factored representation of a Markov Decision Process (MDP). The core idea is to decompose a large‑scale MDP into a set of small, locally defined subsystems (or factors) that are organized in a tree‑shaped hierarchy. Each subsystem has its own local state variables, action variables, transition model, and reward function, and it interacts with its parent only through a limited set of shared variables. Because each subsystem is small, its optimal value function can be obtained by solving a local linear program (LP). The challenge is to coordinate these local solutions so that the collection of policies is globally consistent.

The authors solve this coordination problem with a message‑passing algorithm that operates entirely in a distributed fashion. Two complementary message‑passing phases are defined:

  1. Reward‑message passing (planning phase).
    Each subsystem solves its local LP, obtains a provisional optimal value function, and then computes a “reward message” to send to its parent. This message is essentially a set of Lagrange multipliers that quantify how the child’s local decisions affect the parent’s expected reward. The parent incorporates the received messages as additional local reward terms, resolves its own LP, and sends updated messages back to its children. The process iterates up and down the hierarchy until the messages converge. Convergence is guaranteed because each update is monotone and the underlying global LP is convex; at convergence the collection of local policies together satisfy the global Bellman optimality equations.

  2. Execution‑message passing (execution phase).
    Once the planning messages have converged, each agent executes its local policy using only its current local state and the most recent messages received from its parent. The parent, in turn, observes the actions chosen by its children and can adjust its own action accordingly. This lightweight communication ensures that the globally optimal joint action is realized without any central coordinator.

A key theoretical contribution is the demonstration that the reward messages are precisely the dual variables of a global LP formulation of the factored MDP. Consequently, the distributed algorithm can be interpreted as a decentralized implementation of a primal‑dual optimization scheme. The authors prove that the algorithm yields an optimal joint policy for the original (exponentially large) MDP, despite each agent only ever solving a small LP.

The paper also discusses plan and message reuse. When two sub‑trees of the hierarchy have identical structure and parameters (e.g., multiple robots performing the same task), the optimal local policies and associated messages computed for one sub‑tree can be cached and reused for the other. This dramatically reduces redundant computation, turning the algorithm’s runtime from exponential in the number of agents to essentially linear in the depth of the hierarchy, with a modest constant factor for each distinct sub‑tree type.

Empirical evaluation is carried out on three domains:

  • Cooperative robot navigation – a grid world with 5–20 robots. The centralized planner quickly becomes infeasible (state space grows as (2^{20})), whereas the distributed algorithm solves each robot’s 10‑state local problem and converges in under a minute, delivering the same optimal joint policy.
  • Power‑grid load balancing – a hierarchical network of 15 regional controllers. The centralized dynamic programming approach needs more than 30 minutes, while the message‑passing planner finishes in about 3 minutes with comparable cost.
  • Multi‑UAV mission allocation – many UAVs share identical mission sub‑structures. By caching plans for a representative sub‑tree, the overall planning time drops by roughly 30 % compared with recomputing each sub‑tree independently.

These experiments confirm the theoretical claims: the algorithm scales linearly with the number of distinct sub‑tree types, communication overhead is bounded by the number of edges in the hierarchy, and the final joint policy is provably optimal.

The authors acknowledge several limitations. The current formulation assumes a strict tree hierarchy; extending the method to general directed acyclic graphs or graphs with cycles would require additional convergence analysis and possibly asynchronous update rules. Moreover, the local LPs may still be large for subsystems with continuous state or action spaces, suggesting the need for approximation techniques (e.g., sampled LP, function approximation). Future work is outlined to address dynamic hierarchy changes (agents entering or leaving), partial observability, and non‑cooperative agents.

In summary, the paper delivers a rigorous, distributed planning algorithm for hierarchical factored MDPs that combines local linear programming, dual‑based reward messages, and plan reuse. It bridges the gap between the theoretical optimality of centralized MDP solutions and the practical scalability required for real‑world multi‑agent systems, offering a compelling blueprint for future research and applications in robotics, smart grids, and any domain where large collaborative decision‑making problems arise.