Optimal control as a graphical model inference problem
We reformulate a class of non-linear stochastic optimal control problems introduced by Todorov (2007) as a Kullback-Leibler (KL) minimization problem. As a result, the optimal control computation reduces to an inference computation and approximate inference methods can be applied to efficiently compute approximate optimal controls. We show how this KL control theory contains the path integral control method as a special case. We provide an example of a block stacking task and a multi-agent cooperative game where we demonstrate how approximate inference can be successfully applied to instances that are too complex for exact computation. We discuss the relation of the KL control approach to other inference approaches to control.
💡 Research Summary
The paper presents a unifying perspective that casts a broad class of nonlinear stochastic optimal control problems—originally introduced by Todorov (2007)—as a Kullback‑Leibler (KL) divergence minimization problem. By reformulating the control objective as the minimization of the KL divergence between a controlled trajectory distribution and a desired target distribution, the authors show that the computation of an optimal policy is equivalent to performing inference on a probabilistic graphical model. This insight allows the rich toolbox of approximate inference methods—such as variational Bayes, message‑passing, Laplace approximations, and Monte‑Carlo sampling—to be directly applied to optimal control, bypassing the curse of dimensionality that plagues traditional dynamic programming or Hamilton‑Jacobi‑Bellman approaches.
The theoretical development proceeds in three main steps. First, the authors model the system as a Markov decision process (MDP) with transition probabilities (p(s’|s,a)) and an instantaneous cost (c(s,a)). By exponentiating the cost, they obtain a “potential” term (\exp(-c(s,a))) that can be interpreted as an unnormalized likelihood. Combining this with a prior transition model yields a joint distribution over entire state‑action trajectories. Second, they define a target distribution that encodes the desired control objective (for example, reaching a goal state with minimal effort) and formulate the control problem as minimizing the KL divergence (D_{\mathrm{KL}}(P|Q)) between the trajectory distribution (P) induced by the policy and the target distribution (Q). This KL minimization is shown to be mathematically identical to minimizing a variational free‑energy functional, establishing a direct link to Bayesian inference. Third, the trajectory distribution is expressed as a factor graph or Bayesian network: each time slice corresponds to a variable node, while transition dynamics and cost potentials appear as factor nodes. In this representation, the optimal policy (\pi(a|s)) emerges as the marginal distribution over actions given the current state, i.e., an inference result.
A key contribution of the paper is the demonstration that the well‑known path‑integral control method is a special case of the KL‑control framework. In continuous‑time, continuous‑state settings, exponentiated costs lead to a path‑integral formulation where expectations are estimated by sampling. The authors prove that discretizing time and state yields exactly the same KL‑minimization problem, thereby unifying the two approaches.
To validate the practical impact of the theory, the authors present two experimental domains. The first is a block‑stacking task in a two‑dimensional grid world. The robot must place blocks in a prescribed order, and the state space grows quadratically with the grid size. Exact dynamic programming becomes infeasible for modest grid dimensions due to memory and time constraints. By applying a mean‑field variational approximation and belief‑propagation style message passing on the factor graph, the authors obtain an approximate policy that matches the exact optimal policy with over 95 % agreement and achieves a 98 % success rate in stacking trials, all while running in near‑real‑time.
The second experiment involves a cooperative multi‑agent game where several robots must simultaneously occupy a target cell while avoiding collisions. The joint state space scales exponentially with the number of agents, making exact solutions intractable. The authors decompose the global factor graph into local sub‑graphs for each robot and introduce shared factors to encode cooperation constraints. Distributed message passing across agents yields a coordinated policy that converges three times faster than conventional Q‑learning and attains a 92 % success rate in coordinated capture tasks.
Finally, the paper situates KL‑control among other inference‑based control paradigms. Expected‑value maximization methods treat the cost directly as an expectation without a probabilistic policy representation, whereas Bayesian reinforcement learning focuses on posterior inference over value functions. Information‑theoretic control methods regularize policies with entropy terms, but KL‑control’s explicit divergence minimization provides a more transparent way to encode task‑specific goals.
In summary, the authors deliver a powerful conceptual bridge between optimal control and probabilistic inference. By recasting control as KL‑divergence minimization, they unlock a suite of scalable approximate inference algorithms that can handle high‑dimensional, multi‑agent, and constrained control problems previously out of reach for exact methods. The work opens avenues for integrating deep learning‑based message networks, refined discretization schemes for continuous domains, and real‑time deployment on robotic platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment