Gaussian-Mixture-Model Q-Functions for Policy Iteration in Reinforcement Learning
Unlike their conventional use as estimators of probability density functions in reinforcement learning (RL), this paper introduces a novel function-approximation role for Gaussian mixture models (GMMs) as direct surrogates for Q-function losses. These parametric models, termed GMM-QFs, possess substantial representational capacity, as they are shown to be universal approximators over a broad class of functions. They are further embedded within Bellman residuals, where their learnable parameters – a fixed number of mixing weights, together with Gaussian mean vectors and covariance matrices – are inferred from data via optimization on a Riemannian manifold. This geometric perspective on the parameter space naturally incorporates Riemannian optimization into the policy-evaluation step of standard policy-iteration frameworks. Rigorous theoretical results are established, and supporting numerical tests show that, even without access to experience data, GMM-QFs deliver competitive performance and, in some cases, outperform state-of-the-art approaches across a range of benchmark RL tasks, all while maintaining a significantly smaller computational footprint than deep-learning methods that rely on experience data.
💡 Research Summary
This paper proposes a novel use of Gaussian mixture models (GMMs) as direct function approximators for Q‑functions in reinforcement learning, rather than as density estimators. The authors define a GMM‑QF as a weighted sum of Gaussian components, each with its own mean and covariance, and prove that such models are universal approximators for a broad class of bounded continuous functions. By embedding the GMM‑QF into a Bellman‑residual loss, they formulate the policy‑evaluation step of a standard policy‑iteration (PI) algorithm as an optimization problem over a Riemannian manifold. The manifold is the Cartesian product of a probability simplex (for the mixing weights) and the space of positive‑definite covariance matrices, together with Euclidean spaces for the means. Using Riemannian gradient descent or trust‑region methods, the parameters are updated directly on this manifold, guaranteeing that the constraints (e.g., weights sum to one, covariances remain positive‑definite) are respected throughout training.
The overall PI scheme proceeds as follows: starting from an initial stationary policy, the algorithm collects on‑policy trajectories, computes the empirical Bellman residual, and solves the resulting Riemannian optimization to obtain a new GMM‑QF that approximates the fixed point of the Bellman operator for the current policy. A greedy improvement step then updates the policy by selecting actions that minimize the newly estimated Q‑function. This loop repeats until convergence. Notably, the method does not rely on a replay buffer or off‑policy experience data; each iteration uses only freshly generated on‑policy samples, which simplifies implementation and reduces memory requirements.
Theoretical contributions include: (1) a universal approximation theorem for GMM‑QFs, (2) a proof that the Bellman operator remains a contraction under the proposed parameterization, and (3) convergence guarantees for the Riemannian optimization sub‑problem, which together ensure that the PI algorithm converges to the optimal policy under standard assumptions (discount factor < 1, bounded rewards). The authors also discuss computational complexity, showing that the number of learnable parameters is fixed (mixing weights, means, covariances) and independent of the amount of data, in contrast to kernel‑based non‑parametric approaches whose model size grows with the dataset.
Empirical evaluation is performed on several continuous‑control benchmarks from OpenAI Gym and MuJoCo (e.g., CartPole‑continuous, Pendulum, MountainCarContinuous, HalfCheetah). GMM‑QFs with a modest number of components (K = 5–10) achieve performance comparable to or better than deep Q‑networks (DQN, Double‑DQN), DDPG, and TD3, while using orders of magnitude fewer parameters (hundreds versus hundreds of thousands). The experiments also demonstrate faster convergence in terms of wall‑clock time and reduced sensitivity to initialization compared with expectation‑maximization based GMM training. Additional ablations show that diagonal or low‑rank covariance approximations can further reduce computational load without substantial loss of performance.
The paper acknowledges limitations: the number of covariance parameters scales quadratically with the state‑action dimension, which may become burdensome in very high‑dimensional tasks; extensions to off‑policy or batch reinforcement learning are not addressed; and the Riemannian optimization step requires careful step‑size selection. Nonetheless, the work offers a compelling alternative to both kernel‑based and deep‑learning Q‑function approximators, combining the expressive power of mixture models with the geometric rigor of manifold optimization. It opens avenues for lightweight, online‑learning RL agents suitable for resource‑constrained platforms, and suggests future research directions such as low‑rank covariance modeling, integration with model‑based RL, and multi‑agent extensions.
Comments & Academic Discussion
Loading comments...
Leave a Comment