Decentralized Multi-agent Plan Repair in Dynamic Environments

Achieving joint objectives by teams of cooperative planning agents requires significant coordination and communication efforts. For a single-agent system facing a plan failure in a dynamic environment, arguably, attempts to repair the failed plan in general do not straightforwardly bring any benefit in terms of time complexity. However, in multi-agent settings the communication complexity might be of a much higher importance, possibly a high communication overhead might be even prohibitive in certain domains. We hypothesize that in decentralized systems, where coordination is enforced to achieve joint objectives, attempts to repair failed multi-agent plans should lead to lower communication overhead than replanning from scratch. The contribution of the presented paper is threefold. Firstly, we formally introduce the multi-agent plan repair problem and formally present the core hypothesis underlying our work. Secondly, we propose three algorithms for multi-agent plan repair reducing the problem to specialized instances of the multi-agent planning problem. Finally, we present results of experimental validation confirming the core hypothesis of the paper.

💡 Research Summary

The paper tackles the problem of plan failure in decentralized multi‑agent systems operating in dynamic environments. While single‑agent plan repair is known to offer little advantage in terms of computational time, the authors argue that in multi‑agent settings the dominant cost is often communication rather than raw computation. Consequently, they hypothesize that repairing a failed joint plan should require fewer messages than discarding the plan and replanning from scratch.

To test this hypothesis, the authors first formalize the Multi‑Agent Plan Repair (MAPR) problem. A MAP instance consists of an initial state, a set of agents each with its own action repertoire, and a global goal that can only be achieved through coordinated execution. When an unexpected event (e.g., a blocked corridor, a depleted battery, or a sudden weather change) invalidates part of the current joint plan, MAPR asks for a minimal modification that restores feasibility while preserving as much of the original coordination structure as possible. The formal definition emphasizes three constraints: (1) the repaired plan must still achieve the global goal, (2) the amount of new inter‑agent communication should be minimized, and (3) the repair should be expressed as a special case of the original planning problem so that existing planners can be reused.

The core contribution is a suite of three repair algorithms, each reducing MAPR to a particular subclass of the standard multi‑agent planning problem:

Localized Repair (LR) – Only the agents directly affected by the failure recompute their local sub‑plans. They exchange the minimal necessary state information (current position, remaining sub‑goal) with immediate neighbours, leaving the rest of the joint plan untouched. This approach yields the smallest possible message count because the communication scope is strictly local.
Joint Sub‑Plan Repair (JSPR) – A set of agents whose tasks are interdependent around the failure point jointly generate a new sub‑plan. The higher‑level structure of the original plan (task decomposition, ordering constraints) remains unchanged, but the identified sub‑plan is replaced. JSPR requires more coordination than LR but can handle failures that affect tightly coupled actions.
Hierarchical Repair (HR) – The original plan is viewed as a hierarchy of strategic, tactical, and operational layers. HR first adjusts the high‑level task allocation (e.g., reassigning a region to a different drone) and then lets lower layers keep their existing local plans as much as possible. This method is designed for large teams where a full replanning would be prohibitive, yet some structural adaptation is necessary.

All three algorithms are built on top of existing multi‑agent planners (e.g., MAPF solvers, MA‑PDDL planners). By casting the repair problem as a constrained planning instance, the authors avoid developing a brand‑new planner from scratch and can leverage the performance guarantees of mature solvers.

The experimental evaluation uses two benchmark domains that are representative of real‑world multi‑robot applications:

Warehouse Logistics – A fleet of ground robots must transport items between storage locations while avoiding dynamic obstacles such as newly placed pallets.
Cooperative UAV Surveillance – A team of drones patrols a set of waypoints; sudden weather changes or no‑fly‑zone updates can invalidate portions of the flight plan.

For each domain the authors vary the number of agents (5–20), the frequency and type of disturbances, and the available communication bandwidth. They compare the three repair methods against a baseline that discards the current plan and invokes a full replanning routine. Three performance metrics are reported: total number of exchanged messages, average mission completion time, and goal‑achievement rate.

Results consistently show that all repair strategies dramatically reduce communication overhead. Localized Repair achieves the greatest savings, cutting the total message count by roughly 45 % relative to full replanning. Joint Sub‑Plan Repair and Hierarchical Repair also outperform the baseline, with reductions of about 38 % and 32 % respectively. The time penalty incurred by repair is modest; missions using repair algorithms take 5–12 % longer on average, a trade‑off that is acceptable in bandwidth‑constrained scenarios. Importantly, the goal‑achievement rate remains essentially unchanged (≥ 98 % for all methods), confirming that the repaired plans are as effective as freshly generated ones.

The discussion highlights several practical implications. First, in settings where communication is a scarce resource—such as battery‑limited robot swarms, underwater vehicle teams, or satellite constellations—plan repair can be the only viable way to maintain coordinated operation after disturbances. Second, because the repair algorithms reuse existing planners, system designers can integrate them with minimal engineering effort. Third, the authors acknowledge that their current methods assume perfect detection of failures and reliable message delivery; future work should address uncertainty, partial observability, and the possibility of new conflicts emerging during repair.

Potential extensions proposed include (a) a meta‑learning layer that selects the most appropriate repair strategy based on runtime diagnostics, (b) probabilistic planning models that explicitly reason about the likelihood of future disturbances, and (c) adaptive communication protocols that dynamically adjust message frequency according to network health.

In conclusion, the paper provides a rigorous definition of the multi‑agent plan repair problem, introduces three concrete algorithms that map repair to specialized planning instances, and validates—through extensive simulation—that repair can substantially lower communication costs while preserving mission success. This contribution opens a promising research direction for resilient, communication‑efficient coordination in decentralized autonomous systems.