Model-free policy gradient for discrete-time mean-field control
We study model-free policy learning for discrete-time mean-field control (MFC) problems with finite state space and compact action space. In contrast to the extensive literature on value-based methods for MFC, policy-based approaches remain largely unexplored due to the intrinsic dependence of transition kernels and rewards on the evolving population state distribution, which prevents the direct use of likelihood-ratio estimators of policy gradients from classical single-agent reinforcement learning. We introduce a novel perturbation scheme on the state-distribution flow and prove that the gradient of the resulting perturbed value function converges to the true policy gradient as the perturbation magnitude vanishes. This construction yields a fully model-free estimator based solely on simulated trajectories and an auxiliary estimate of the sensitivity of the state distribution. Building on this framework, we develop MF-REINFORCE, a model-free policy gradient algorithm for MFC, and establish explicit quantitative bounds on its bias and mean-squared error. Numerical experiments on representative mean-field control tasks demonstrate the effectiveness of the proposed approach.
💡 Research Summary
The paper tackles model‑free reinforcement learning for discrete‑time mean‑field control (MFC) with a finite state space and compact action space. In MFC the transition kernel and reward functions depend on the evolving population distribution, which makes the classic likelihood‑ratio policy‑gradient estimator inapplicable. The authors first derive an exact policy‑gradient expression (Proposition 2.1) that decomposes into three terms: the standard REINFORCE term (RF), a measure‑derivative term from the reward (MD), and a mean‑field derivative term from the transition kernel and policy (MFD). Because MD and MFD involve derivatives with respect to the population distribution, they cannot be estimated directly without a model.
To overcome this, the authors parametrize probability distributions on the state space by their logits (log‑probabilities) and introduce a small linear perturbation of these logits. They define a perturbed value function V^ε(θ) and prove (Theorem 2.4) that as the perturbation magnitude ε→0, the gradient of V^ε converges to the true gradient ∇_θ V. This construction yields a fully model‑free estimator that only requires simulated trajectories and an auxiliary estimate of the sensitivity of the logits (∇_θ l_t).
Building on this, the MF‑REINFORCE algorithm is presented (Algorithm 1). For each sampled episode it computes the standard REINFORCE log‑likelihood term, estimates the logits’ sensitivities via the perturbation scheme, and assembles the MD and MFD contributions. Theoretical analysis (Theorems 3.3 and 3.5) provides explicit bounds: the bias scales as O(ε), while the mean‑squared error scales as O(ε² + 1/N) with N the number of trajectories, showing that bias can be made arbitrarily small by choosing ε sufficiently small and variance reduced by increasing sample size.
Numerical experiments on three representative MFC tasks—a discrete line‑search problem, a buffer control system, and a cooperative multi‑agent game—demonstrate that MF‑REINFORCE converges faster and attains higher final rewards than existing value‑based MF‑Q‑learning methods. The algorithm remains stable when policies are represented by neural networks, confirming its practicality for high‑dimensional parameterizations.
In summary, the work introduces a novel perturbation‑based framework that enables unbiased, model‑free policy‑gradient learning for mean‑field control, supplies rigorous convergence guarantees, and validates the approach empirically. This opens the door to applying policy‑gradient reinforcement learning to large‑scale cooperative systems such as smart grids, traffic networks, and distributed resource allocation where mean‑field models are natural.
Comments & Academic Discussion
Loading comments...
Leave a Comment