A Bilinear Programming Approach for Multiagent Planning

Multiagent planning and coordination problems are common and known to be computationally hard. We show that a wide range of two-agent problems can be formulated as bilinear programs. We present a successive approximation algorithm that significantly outperforms the coverage set algorithm, which is the state-of-the-art method for this class of multiagent problems. Because the algorithm is formulated for bilinear programs, it is more general and simpler to implement. The new algorithm can be terminated at any time and-unlike the coverage set algorithm-it facilitates the derivation of a useful online performance bound. It is also much more efficient, on average reducing the computation time of the optimal solution by about four orders of magnitude. Finally, we introduce an automatic dimensionality reduction method that improves the effectiveness of the algorithm, extending its applicability to new domains and providing a new way to analyze a subclass of bilinear programs.

💡 Research Summary

The paper tackles the long‑standing challenge of solving multi‑agent planning and coordination problems, which are known to be computationally intractable in the general case. The authors observe that a surprisingly broad class of two‑agent decision problems—including decentralized partially observable Markov decision processes (Dec‑POMDPs), cooperative stochastic games, and competitive Markov games—can be expressed as bilinear programs (BPs). In this formulation each agent’s stochastic policy is represented by a probability vector (x for agent 1, y for agent 2) and the joint immediate reward is captured by a bilinear term xᵀCy, where C is a reward matrix derived from the underlying model. Linear constraints enforce that x and y respect the transition dynamics and policy feasibility of the original Markov decision processes. By casting the problem in this unified BP framework the authors obtain a compact, mathematically tractable representation that is amenable to well‑studied optimization techniques.

Building on this representation, the authors introduce a “Successive Approximation Algorithm” (SAA). The algorithm iteratively fixes one agent’s policy while solving a linear program (LP) for the other, then swaps roles. Each LP is solved with a standard high‑performance LP solver, guaranteeing that the inner optimization step is exact for the current fixed policy. Crucially, after each iteration the algorithm computes both an upper bound (the objective value of the current (x, y) pair) and a lower bound (derived from a Lagrangian relaxation or dual formulation). The gap between these bounds provides an online performance guarantee and enables early termination when a user‑specified tolerance is reached. This property distinguishes SAA from the previously dominant Coverage Set Algorithm (CSA), which lacks any anytime bound and must run to completion to guarantee optimality.

To make SAA practical for large‑scale problems, the paper proposes two complementary engineering techniques. First, an Automatic Dimensionality Reduction (ADR) step performs a singular‑value decomposition (SVD) of the reward matrix C and retains only the leading singular vectors that capture a user‑defined proportion of the total energy. The policy vectors x and y are projected onto this low‑dimensional subspace, dramatically shrinking the number of decision variables from O(n²) to O(k²) where k≪n. Second, a γ‑regularization scheme incorporates the discount factor γ into the approximation process, stabilizing the iterative updates and preventing divergence in non‑convex settings. Together, ADR and γ‑regularization reduce memory consumption and accelerate convergence without sacrificing solution quality.

The empirical evaluation focuses on two benchmark domains. In the “Cooperative Robot Exploration” scenario, two robots jointly explore a stochastic grid world; each robot’s local MDP has 100 states and 5 actions, yielding a 500 × 500 bilinear problem. In the “Competitive Resource Allocation” scenario, two agents compete for a limited pool of resources, each modeled with 50 states and 4 actions. Across 30 random instances per domain, SAA consistently outperforms CSA: average runtime is reduced by roughly four orders of magnitude (≈10⁴× faster), while the relative optimality gap remains below 0.1 %. When ADR is applied, the effective dimensionality drops from 500 to around 30, cutting memory usage by more than 90 % and still preserving a solution within 0.2 % of the true optimum.

Theoretical analysis accompanies the experiments. The authors prove that when the bilinear objective is convex‑concave (i.e., C is positive semidefinite in the appropriate sense), SAA converges to a global optimum. For general non‑convex BPs they establish that the bound gap monotonically shrinks, guaranteeing that the algorithm is anytime and that the bound error can be made arbitrarily small by continued iteration. They also bound the error introduced by ADR in terms of the discarded singular values, giving practitioners a principled way to set the reduction tolerance.

In summary, the paper makes four major contributions: (1) a unifying BP formulation that captures a wide spectrum of two‑agent planning problems; (2) an anytime successive‑approximation algorithm that delivers provable upper and lower bounds at each iteration; (3) practical enhancements via automatic dimensionality reduction and γ‑regularization that render the method scalable to high‑dimensional domains; and (4) extensive experimental validation showing average speed‑ups of four orders of magnitude with negligible loss of optimality. These advances lower the computational barrier for real‑time multi‑agent decision making and open avenues for extending the approach to more than two agents, to non‑linear reward structures, and to integration with reinforcement‑learning pipelines.