Enter the Void - Planning to Seek Entropy When Reward is Scarce

Enter the Void - Planning to Seek Entropy When Reward is Scarce
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. These models constitute the large majority of training compute and time and they are subsequently used to train actors entirely in simulation, but once this is done they are quickly discarded. We show in this work that utilising these models at inference time can significantly boost sample efficiency. We propose a novel approach that anticipates and actively seeks out informative states using the world model’s short-horizon latent predictions, offering a principled alternative to traditional curiosity-driven methods that chase outdated estimates of high uncertainty states. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multiple multi-step plans at every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the commitment to searching entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to Dreamer to illustrate the concept. Our method finishes MiniWorld’s procedurally generated mazes 50% faster than base Dreamer at convergence and in only 60% of the environment steps that base Dreamer’s policy needs; it displays reasoned exploratory behaviour in Crafter, achieves the same reward as base Dreamer in a third of the steps; planning tends to improve sample efficiency on DeepMind Control tasks.


💡 Research Summary

The paper tackles a long‑standing inefficiency in model‑based reinforcement learning (MBRL): after training a world model, most algorithms discard it at inference time and rely solely on the learned policy. The authors propose to keep the world model alive during deployment and use it to actively seek high‑entropy (i.e., uncertain) states, thereby improving exploration and sample efficiency. Their method is built on DreamerV3, a state‑of‑the‑art latent‑space world model that combines a deterministic recurrent state with a stochastic latent variable (RSSM).

Key technical contributions are:

  1. Entropy‑based intrinsic objective – The prior distribution over the latent state, pϕ(zₜ|hₜ), is used as a proxy for future information gain. Maximising its Shannon entropy encourages the planner to visit regions where the model’s belief is most uncertain, which in turn is expected to produce a large reduction in posterior uncertainty once the true observation arrives. This creates a natural min‑max interaction: the model training minimizes KL(posterior‖prior) while the planner maximises prior entropy.

  2. Short‑horizon rollout selection – At each environment step the planner generates N=256 imagined rollouts of length H=15 using the greedy actor together with the world model. Each rollout is scored by a weighted sum of predicted cumulative reward (λᵣ·ĥr) and cumulative entropy (λᴴ·H). The rollout τ* with the highest combined score is selected, and its action sequence is executed unless the meta‑planner decides otherwise.

  3. Reactive hierarchical meta‑planner – A lightweight PPO head monitors the ongoing plan and decides dynamically whether to commit to it or to abort and generate a new plan. The decision is based on recent changes in predicted entropy and reward, effectively detecting when a plan has become “stale”. This hierarchical design mitigates the common MPC drawback of replanning at every step, reducing dithering and allowing the agent to develop a notion of commitment.

  4. Compatibility and simplicity – The approach does not modify Dreamer’s world‑model or actor losses; all KL coefficients and training objectives remain unchanged. Consequently, the method can be attached to any MBRL system that trains its policy exclusively on imagined data.

Empirical evaluation spans three domains:

  • MiniWorld 3D mazes – The entropy‑seeking planner reaches the goal 50 % faster than vanilla Dreamer and does so using only 60 % of the environment steps required by the baseline.
  • Crafter – The agent attains the same final score in roughly one‑third of the steps, displaying purposeful exploration that first uncovers useful resources before tackling higher‑level tasks.
  • DeepMind Control (vision‑based) – Sample efficiency improves modestly across several continuous‑control benchmarks, confirming that the method scales beyond discrete navigation tasks.

Failure modes and limitations are discussed in depth. Because Dreamer’s prior is unimodal, environments with inherently stochastic transitions can inflate entropy (the “white‑noise” problem), potentially leading the planner to chase aleatoric uncertainty. The authors argue that the simultaneous reward term mitigates pathological fixation on such states. A second issue arises when rare but important transitions are under‑represented in the data; the model’s prior may underestimate entropy for those states, causing under‑exploration. The paper suggests future work on mode‑seeking mechanisms (option discovery, teacher‑student learning) to address this.

In summary, the work demonstrates that re‑using the world model at inference time as an entropy‑driven planner yields substantial gains in exploration efficiency. By coupling a short‑horizon, entropy‑reward weighted rollout selector with a reactive PPO‑based meta‑planner, the method achieves a balance between commitment and adaptability that many existing MPC or curiosity‑based approaches lack. The results suggest a promising direction for making MBRL more practical in real‑world settings where data collection is expensive and rewards are sparse.


Comments & Academic Discussion

Loading comments...

Leave a Comment