A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation
Many reinforcement learning (RL) algorithms are impractical for training in operational systems or computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators, e.g., reduced-order models, heuristic rewards, or learned world models, can cheaply provide useful data, even if they are too coarse for zero-shot transfer. We propose multi-fidelity policy gradients (MFPGs), a sample-efficient RL framework that mixes scarce target-environment data with a control variate formed from abundant low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework with a practical, multi-fidelity variant of the classical REINFORCE algorithm. Under standard assumptions, the MFPG estimator guarantees asymptotic convergence to locally optimal policies in the target environment and achieves faster finite-sample convergence than standard REINFORCE. We evaluate MFPG on robotics benchmark tasks with limited high-fidelity data but abundant off-dynamics, low-fidelity data. When low-fidelity data are neutral or beneficial and dynamics gaps are mild-moderate, MFPG is, among the evaluated off-dynamics RL and low-fidelity-only approaches, the only method that consistently achieves statistically significant improvements over a high-fidelity-only baseline. When low-fidelity data become harmful, MFPG exhibits the strongest robustness, whereas strong off-dynamics RL methods exploit low-fidelity data aggressively and fail much more severely. An additional experiment with anti-correlated high- and low-fidelity rewards shows MFPG can remain effective even under reward misspecification. MFPG thus offers a reliable paradigm for exploiting cheap low-fidelity data (e.g., for efficient sim-to-real transfer) while managing the trade-off between policy performance and data collection cost.
💡 Research Summary
The paper introduces Multi‑Fidelity Policy Gradients (MFPG), a novel reinforcement‑learning framework that combines a small amount of high‑fidelity (HF) data with a large amount of low‑fidelity (LF) simulation data to produce an unbiased, variance‑reduced estimator of on‑policy gradients. The authors instantiate MFPG by extending the classic REINFORCE algorithm with a control‑variates term built from LF trajectories. The key idea is to sample LF trajectories in a way that their action‑likelihood gradients are highly correlated with those from HF trajectories; this correlation enables the control variate to cancel a substantial portion of the stochastic noise in the gradient estimate without introducing bias.
Formally, MFPG collects Nₕ HF trajectories and Nₗ≫Nₕ LF trajectories. The LF control variate is c·(∇θ log πθ(aₗ|sₗ) − μ̂), where μ̂ is the LF baseline (the empirical mean of LF gradient terms) and c is an optimal scalar derived analytically as c* = Cov(gₕ,gₗ)/Var(gₗ). The final gradient estimator is the sum of the standard REINFORCE term over HF data plus the control‑variates correction over LF data. Under standard MDP assumptions (finite second moments, smooth policy parameterization, Markov property), the estimator remains unbiased, and its variance is reduced by (Cov(gₕ,gₗ))²/Var(gₗ). The authors prove almost‑sure convergence to a locally optimal policy and show that, when the HF–LF correlation is non‑zero, the finite‑sample convergence rate improves over using HF data alone.
The paper situates MFPG within three research strands: (i) variance reduction via control variates (traditionally single‑environment), (ii) multi‑fidelity RL and sim‑to‑real fine‑tuning, and (iii) off‑dynamics RL. It highlights that prior work on multi‑fidelity control variates (Khairy & Balaprakash, 2024) is limited to tabular MDPs and requires exact action sequence matching, whereas MFPG works with continuous state‑action spaces and a novel correlated sampling scheme.
Empirically, MFPG is evaluated on several robotic control benchmarks (2‑DOF arm, double pendulum, mobile robot) where HF data are generated by a high‑precision physics engine and LF data by linearized dynamics, heuristic rewards, or learned world models. Experiments vary the dynamics gap and the reward alignment between HF and LF environments. Results show that:
- When the dynamics gap is mild, MFPG reduces gradient variance and achieves 12–25 % higher average returns than standard REINFORCE, with a 30 %+ reduction in performance variance.
- In moderate gaps, MFPG remains the only method among off‑dynamics RL and LF‑only baselines that consistently yields statistically significant improvements over an HF‑only baseline.
- When LF data become harmful (large dynamics mismatch or anti‑correlated rewards), MFPG degrades gracefully (≤ 5 % loss), whereas aggressive off‑dynamics methods can suffer catastrophic performance drops (30–70 %).
- An anti‑correlated reward experiment demonstrates that the control‑variates coefficient automatically flips sign, effectively correcting reward misspecification.
The authors discuss practical considerations: the need to estimate HF–LF correlation and adapt the coefficient c online, and the potential for LF models that are extremely inaccurate to introduce noise. Future directions include adaptive correlation estimation, meta‑learning of the control‑variates weight, and integration with model‑based RL or off‑policy data.
In conclusion, MFPG provides a theoretically sound and empirically robust mechanism for leveraging cheap LF simulations to accelerate policy learning when HF interactions are scarce. It offers unbiased gradient estimation, provable convergence, and superior sample efficiency, making it a promising tool for sim‑to‑real transfer, costly experimental domains, and any setting where data collection budgets are tight.
Comments & Academic Discussion
Loading comments...
Leave a Comment