The Surprising Difficulty of Search in Model-Based Reinforcement Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a plug-and-play replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating distribution shift matters more than improving model or value function accuracy. Building on this insight, we identify key techniques for enabling effective search, achieving state-of-the-art performance across multiple popular benchmark domains.

💡 Research Summary

This paper reexamines the role of search in model‑based reinforcement learning (MBRL), challenging the prevailing belief that model accuracy and long‑horizon prediction errors are the primary obstacles. The authors first present a theoretical counterexample using an “absorbing N‑chain” environment, where only a single action advances the agent toward a reward while all other actions lead to a zero‑reward absorbing state. They show that even with a perfect dynamics model and an exact value function, a naive random‑sampling search fails with high probability as the search horizon or action space grows, because the probability of sampling the unique rewarding trajectory decays as 1 − (1 − 1/Aⁿ)ᵐ. This demonstrates that exhaustive search becomes infeasible regardless of model fidelity.

Empirically, the paper compares two state‑of‑the‑art algorithms: MR.Q, a model‑free method that uses a model only for representation learning, and TD‑MPC2, a model‑based method that performs short‑horizon model‑predictive control (MPC). When MPC is added to MR.Q, performance dramatically deteriorates across a suite of continuous‑control tasks, even though MR.Q’s learned dynamics are more accurate than TD‑MPC2’s. Conversely, TD‑MPC2 benefits from MPC despite having higher dynamics error. These results refute the assumption that better models automatically translate into better planning performance.

The authors identify the root cause as a distribution shift introduced by search. MPC generates data from a policy that differs from the one used to train the value function (typically a non‑search policy). This mismatch leads to overestimation bias in the learned Q‑function for out‑of‑distribution actions, which in turn degrades both value estimation and overall policy performance.

To mitigate this bias, the paper proposes using a pessimistic ensemble of value functions: multiple Q‑networks are trained independently, and during search the minimum Q‑value across the ensemble is taken as the estimate for each candidate action. This “minimum‑over‑ensemble” approach provides a conservative estimate that curbs overoptimistic predictions on unseen actions. The resulting algorithm, Model‑based Representations for Search and Q‑learning (MRS.Q), incorporates this technique while retaining the representation learning benefits of MR.Q.

MRS.Q is evaluated on over 50 benchmark tasks spanning the DeepMind Control Suite and OpenAI Gym, using a single fixed hyper‑parameter configuration. Across the board, MRS.Q outperforms both leading model‑free baselines (e.g., SAC, TD3) and leading model‑based baselines (e.g., TD‑MPC2, MBPO). Notably, in high‑dimensional continuous control problems such as Humanoid and Walker, MRS.Q achieves more than double the performance of its competitors. Ablation studies confirm that the pessimistic ensemble is essential: removing it or using the mean Q‑value leads to severe performance drops, especially as the planning horizon or action dimensionality increases.

The key insights of the paper are: (1) search is not a plug‑and‑play replacement for a learned policy; (2) model accuracy alone does not guarantee successful planning; (3) the interaction between search and value learning creates a distribution shift that induces overestimation bias; and (4) addressing this bias via conservative ensemble estimates unlocks the true potential of search in MBRL. The work shifts the research focus from solely improving dynamics models to carefully managing the tension between model‑based search and value function learning, opening new avenues for more robust and sample‑efficient reinforcement learning.

The Surprising Difficulty of Search in Model-Based Reinforcement Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment