Bayesian learning of noisy Markov decision processes
We consider the inverse reinforcement learning problem, that is, the problem of learning from, and then predicting or mimicking a controller based on state/action data. We propose a statistical model for such data, derived from the structure of a Markov decision process. Adopting a Bayesian approach to inference, we show how latent variables of the model can be estimated, and how predictions about actions can be made, in a unified framework. A new Markov chain Monte Carlo (MCMC) sampler is devised for simulation from the posterior distribution. This step includes a parameter expansion step, which is shown to be essential for good convergence properties of the MCMC sampler. As an illustration, the method is applied to learning a human controller.
💡 Research Summary
The paper tackles the inverse reinforcement learning (IRL) problem—learning a controller’s underlying objectives from observed state‑action trajectories—by embedding it within a Bayesian formulation of a Markov decision process (MDP). Traditional IRL methods often assume deterministic optimality or treat noise as an after‑thought, which limits their applicability to real‑world data that inevitably contain observation errors, sub‑optimal human behavior, and environmental stochasticity. To address this, the authors construct a probabilistic model that mirrors the generative structure of an MDP: a known transition kernel (P(s’|s,a)) and a stochastic policy (\pi(a|s;\beta,R)) defined through a soft‑max function over a latent reward function (R(s,a)) and a temperature‑like parameter (\beta). The parameter (\beta) governs the level of randomness in action selection; as (\beta\to\infty) the policy collapses to a deterministic optimal policy, while (\beta\to0) yields near‑uniform random actions, thereby explicitly modeling the “noise” in observed behavior.
Within the Bayesian framework, priors are placed on both the reward function and (\beta). The reward is typically given a Gaussian or Laplacian prior to encourage smoothness across neighboring state‑action pairs, whereas (\beta) receives a Gamma prior to enforce positivity and allow flexibility in the degree of stochasticity. Given a dataset (\mathcal{D}={(s_t,a_t)}_{t=1}^T), the posterior distribution (p(R,\beta|\mathcal{D})\propto p(\mathcal{D}|R,\beta)p(R)p(\beta)) is analytically intractable, prompting the use of Markov chain Monte Carlo (MCMC) for inference.
A central technical contribution is the design of a parameter‑expanded MCMC (PX‑MCMC) sampler. Standard Gibbs or Metropolis‑Hastings updates suffer from strong posterior coupling between the high‑dimensional reward vector and the scalar (\beta), leading to slow mixing and high autocorrelation. The authors introduce auxiliary variables and a deterministic transformation that expands the parameter space, effectively decorrelating (R) and (\beta) during sampling. In practice, the algorithm alternates between (i) a Gibbs step that samples the reward conditioned on the current (\beta) and auxiliary variables, and (ii) a Metropolis‑Hastings step that jointly proposes new values for (\beta) and the auxiliary variables. This expansion dramatically reduces the integrated autocorrelation time—empirically by a factor of five or more—allowing the chain to explore the posterior efficiently with comparable computational effort.
The methodology is validated on both synthetic benchmarks and a real‑world human‑pilot dataset. In synthetic experiments, where the true reward and (\beta) are known, the PX‑MCMC accurately recovers the ground‑truth parameters and outperforms naïve MCMC in convergence speed. For the human‑pilot study, participants controlled a flight simulator; their state‑action logs were fed into the model. The Bayesian approach successfully reproduced the observed actions with high posterior probability, and the inferred reward landscape highlighted intuitive objectives such as altitude maintenance and fuel efficiency, aligning with the pilots’ actual goals. Predictive performance on held‑out trajectories reached over 85 % accuracy, and the log‑likelihood of the Bayesian model exceeded that of maximum‑likelihood IRL baselines, demonstrating both better fit and better generalisation.
In summary, the paper makes three key contributions: (1) a principled Bayesian model that treats noisy observed behavior as a natural consequence of a stochastic MDP policy, enabling simultaneous estimation of rewards and decision noise; (2) a novel parameter‑expansion MCMC scheme that overcomes the mixing difficulties inherent in high‑dimensional latent‑variable inference, providing robust and scalable posterior sampling; and (3) an empirical demonstration that the approach yields interpretable reward functions and superior predictive power on realistic human‑control data. These advances broaden the applicability of IRL to domains such as robotics, human‑machine interaction, and autonomous systems, where understanding the hidden objectives behind noisy actions is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment