Approximate Inference and Stochastic Optimal Control

We propose a novel reformulation of the stochastic optimal control problem as an approximate inference problem, demonstrating, that such a interpretation leads to new practical methods for the original problem. In particular we characterise a novel class of iterative solutions to the stochastic optimal control problem based on a natural relaxation of the exact dual formulation. These theoretical insights are applied to the Reinforcement Learning problem where they lead to new model free, off policy methods for discrete and continuous problems.

💡 Research Summary

The paper presents a novel perspective on stochastic optimal control (SOC) by casting it as an approximate inference problem. Starting from the classic formulation—minimizing the expected cumulative cost under system dynamics—the authors reinterpret the control task as a probabilistic graphical model. They define a target trajectory distribution proportional to the product of the dynamics and an exponential of the negative cost, and show that finding an optimal policy is equivalent to minimizing the Kullback‑Leibler (KL) divergence between this target distribution and the trajectory distribution induced by a candidate policy. This exact dual formulation mirrors the free‑energy objective common in variational inference and reveals a deep connection to the Expectation‑Maximization (EM) algorithm.

Because directly computing the posterior over trajectories is intractable for high‑dimensional continuous systems, the authors propose a natural relaxation. They replace the exact posterior with a parameterized policy πθ and approximate the KL term using importance‑weighted Monte‑Carlo samples. The resulting iterative scheme, named Iterative Inference Control (IIC), alternates between (1) sampling trajectories under the current policy, weighting each by exp(−∑c(s,a)), and (2) updating the policy parameters via a gradient ascent step on the weighted log‑likelihood. This procedure can be viewed as a variational EM where the E‑step is approximated by importance sampling and the M‑step corresponds to a policy update. The authors prove that, under mild regularity conditions and provided the policy class is expressive enough, the sequence of policies converges to a local optimum of the original SOC problem.

A key advantage of IIC is that it is model‑free and off‑policy. The importance weights allow reuse of data generated by any behavior policy, which dramatically improves sample efficiency in continuous control settings. To validate the approach, the authors conduct experiments on both discrete (GridWorld) and continuous (Pendulum, Hopper, Walker2d in MuJoCo) benchmarks. In the discrete domain, IIC matches the convergence speed of classic policy iteration while automatically balancing exploration and exploitation through its entropy‑like weighting. In the continuous domain, IIC outperforms or matches state‑of‑the‑art model‑free algorithms such as DDPG, SAC, and PPO, especially when leveraging off‑policy logs: it achieves comparable final returns with roughly 30 % fewer environment interactions.

The paper also situates IIC within the broader “Maximum Entropy Reinforcement Learning” literature. Both frameworks maximize a cost‑weighted entropy, but IIC derives its update rule from a variational inference standpoint, which inherently adapts the entropy regularization coefficient rather than treating it as a fixed hyper‑parameter. This reduces the need for extensive hyper‑parameter tuning.

Finally, the authors discuss limitations and future directions. The relaxation assumes the policy family can represent the optimal solution; extending the method to high‑dimensional observation spaces (e.g., raw images) or multi‑objective settings remains open. They suggest integrating deep Bayesian networks to form a “Deep Variational Control” architecture, and they highlight the need for tighter theoretical bounds on convergence rates and sample complexity, as well as real‑time implementations on physical robots.

In summary, the work bridges stochastic optimal control and approximate Bayesian inference, introduces a practical iterative algorithm that is both model‑free and off‑policy, and demonstrates competitive performance on standard reinforcement learning benchmarks, thereby opening new avenues for inference‑driven control methods.

💡 Research Summary

📜 Original Paper Content