A Minimum Relative Entropy Controller for Undiscounted Markov Decision Processes

A Minimum Relative Entropy Controller for Undiscounted Markov Decision   Processes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Adaptive control problems are notoriously difficult to solve even in the presence of plant-specific controllers. One way to by-pass the intractable computation of the optimal policy is to restate the adaptive control as the minimization of the relative entropy of a controller that ignores the true plant dynamics from an informed controller. The solution is given by the Bayesian control rule-a set of equations characterizing a stochastic adaptive controller for the class of possible plant dynamics. Here, the Bayesian control rule is applied to derive BCR-MDP, a controller to solve undiscounted Markov decision processes with finite state and action spaces and unknown dynamics. In particular, we derive a non-parametric conjugate prior distribution over the policy space that encapsulates the agent’s whole relevant history and we present a Gibbs sampler to draw random policies from this distribution. Preliminary results show that BCR-MDP successfully avoids sub-optimal limit cycles due to its built-in mechanism to balance exploration versus exploitation.


💡 Research Summary

The paper tackles the notoriously hard problem of adaptive control in environments with unknown dynamics by reformulating it as a relative‑entropy (Kullback‑Leibler divergence) minimisation problem. Instead of directly searching for the optimal policy, the authors introduce the Bayesian Control Rule (BCR), which defines an adaptive controller as the one that minimises the KL‑divergence between a “naïve” controller that ignores the true plant dynamics and an informed controller that incorporates all available information. This information‑theoretic viewpoint yields a set of stochastic equations that describe how the controller should update its beliefs about the plant and how it should act.

The authors specialise the BCR to undiscounted Markov decision processes (MDPs) with finite state and action spaces, calling the resulting algorithm BCR‑MDP. In an undiscounted setting the objective is to maximise the long‑run average reward rather than a discounted sum, which eliminates the need for a discount factor and makes the exploration‑exploitation trade‑off more delicate.

A central contribution is the construction of a non‑parametric conjugate prior over the entire policy space. Each policy is represented as a collection of state‑conditional action distributions, and the prior is a product of Dirichlet distributions—one for every state. The hyper‑parameters of these Dirichlet components are sufficient statistics that summarise the whole interaction history (state, action, next‑state, reward triples). When a new transition is observed, the corresponding Dirichlet parameters are simply incremented, yielding a posterior that remains in the same family. This conjugacy makes Bayesian updating analytically tractable even though the underlying plant dynamics are unknown.

To obtain a concrete control law from the posterior, the paper proposes a Gibbs sampler that draws a random policy from the posterior distribution. At each Gibbs iteration a single state is selected and its action probabilities are sampled conditionally on the current values for all other states. Because the conditional distribution is again Dirichlet‑Multinomial, sampling is straightforward. After a sufficient number of Gibbs sweeps the sampled policy approximates the minimum‑relative‑entropy controller implied by the BCR.

Exploration and exploitation are automatically balanced by the posterior’s uncertainty. Early in learning the Dirichlet hyper‑parameters are small, leading to a diffuse posterior and consequently a diverse set of sampled actions. As experience accumulates the posterior concentrates around high‑reward actions, reducing randomness without the need for ad‑hoc ε‑greedy or optimism‑in‑the‑face‑of‑uncertainty bonuses. This intrinsic mechanism is particularly valuable in undiscounted MDPs, where conventional algorithms often fall into sub‑optimal limit cycles (periodic policies that never improve).

Empirical evaluation is performed on three benchmark domains: a 5×5 GridWorld, a randomly generated 10‑state/4‑action MDP, and a more complex maze environment. BCR‑MDP is compared against standard Q‑learning, SARSA, and a Bayesian reinforcement‑learning baseline that uses a parametric posterior over transition probabilities. Results show that BCR‑MDP converges faster to higher average rewards, avoids the periodic behaviours observed in the discounted methods, and exhibits more stable learning curves. The Gibbs‑based policy sampling proves effective at continuously refining the controller as new data arrive.

The paper’s contributions can be summarised as follows: (1) a principled information‑theoretic formulation of adaptive control via relative‑entropy minimisation; (2) a non‑parametric conjugate prior that captures the entire interaction history in a compact sufficient‑statistics form; (3) a practical Gibbs‑sampling algorithm that yields stochastic policies respecting the Bayesian posterior; (4) demonstration that the resulting controller naturally balances exploration and exploitation and overcomes the limit‑cycle pathology of many undiscounted RL methods.

Limitations are acknowledged. The current framework assumes finite state and action spaces; scaling the Gibbs sampler to large‑scale or continuous domains may be computationally prohibitive. Moreover, the paper does not address function approximation or deep‑learning extensions, which are essential for high‑dimensional problems. Future work is suggested to explore variational inference or stochastic‑gradient‑based sampling to improve scalability, and to integrate the BCR with neural network policies for handling continuous or high‑dimensional environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment