Efficient Inference in Markov Control Problems

Markov control algorithms that perform smooth, non-greedy updates of the policy have been shown to be very general and versatile, with policy gradient and Expectation Maximisation algorithms being particularly popular. For these algorithms, marginal inference of the reward weighted trajectory distribution is required to perform policy updates. We discuss a new exact inference algorithm for these marginals in the finite horizon case that is more efficient than the standard approach based on classical forward-backward recursions. We also provide a principled extension to infinite horizon Markov Decision Problems that explicitly accounts for an infinite horizon. This extension provides a novel algorithm for both policy gradients and Expectation Maximisation in infinite horizon problems.

💡 Research Summary

The paper addresses a fundamental computational bottleneck in modern non‑greedy policy‑update methods for Markov Decision Processes (MDPs), namely the need to compute the marginal distribution of the reward‑weighted trajectory under the current policy. Traditional solutions rely on the forward‑backward (FB) algorithm, which, while exact, incurs a quadratic cost in the planning horizon (O(H²·|S|·|A|)) and a comparable memory footprint. This makes it impractical for long‑horizon or infinite‑horizon problems, which are common in reinforcement learning and control.

The authors propose a novel exact inference scheme that reduces the complexity to linear in the horizon (O(H·|S|·|A|)) for finite‑horizon problems and extends naturally to the infinite‑horizon case. The key insight is that the reward‑weighted trajectory distribution can be expressed as a product of a forward message (the usual state‑distribution under the policy) and a specially constructed backward message that already incorporates the cumulative reward weighting. By computing the backward message once, propagating it recursively, and then pairing it with the forward message, the marginal for each time step is obtained without the double‑loop of standard FB.

Formally, the backward message is defined as
βₜ(sₜ,aₜ) = Σ_{sₜ₊₁,aₜ₊₁} P(sₜ₊₁|sₜ,aₜ) π(aₜ₊₁|sₜ₊₁) R(sₜ₊₁,aₜ₊₁) βₜ₊₁(sₜ₊₁,aₜ₊₁)
with β_H(s_H,a_H)=1. The forward message αₜ(sₜ) follows the usual recursion based on the transition model and the current policy. The marginal is then proportional to αₜ(sₜ) π(aₜ|sₜ) βₜ(sₜ,aₜ). Because βₜ already aggregates all future reward contributions, there is no need for a second pass over the horizon for each time step, yielding the linear‑time algorithm.

For infinite‑horizon problems, the authors introduce a fixed‑point operator 𝒯 acting on the backward message: β = 𝒯(β). Under standard discounting (γ<1) and assuming the transition matrix is a contraction, repeated application of 𝒯 converges to a unique β* that captures the infinite sum of discounted rewards. This eliminates the need for ad‑hoc truncation or “trunk‑tail” approximations and provides an exact marginal even when the horizon is unbounded.

The paper demonstrates how the new inference routine can be plugged directly into two widely used policy‑learning frameworks:

Policy Gradient – The gradient of the expected return J(θ) = E_{p}