Open-World Reinforcement Learning over Long Short-Term Imagination

Open-World Reinforcement Learning over Long Short-Term Imagination
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Training visual reinforcement learning agents in a high-dimensional open world presents significant challenges. While various model-based methods have improved sample efficiency by learning interactive world models, these agents tend to be “short-sighted”, as they are typically trained on short snippets of imagined experiences. We argue that the primary challenge in open-world decision-making is improving the exploration efficiency across a vast state space, especially for tasks that demand consideration of long-horizon payoffs. In this paper, we present LS-Imagine, which extends the imagination horizon within a limited number of state transition steps, enabling the agent to explore behaviors that potentially lead to promising long-term feedback. The foundation of our approach is to build a $\textit{long short-term world model}$. To achieve this, we simulate goal-conditioned jumpy state transitions and compute corresponding affordance maps by zooming in on specific areas within single images. This facilitates the integration of direct long-term values into behavior learning. Our method demonstrates significant improvements over state-of-the-art techniques in MineDojo.


💡 Research Summary

LS‑Imagine tackles a fundamental bottleneck in visual model‑based reinforcement learning (MBRL) for open‑world environments: the short imagination horizon that limits long‑term planning. The authors introduce a “long short‑term world model” that simultaneously learns two transition branches. The short‑term branch predicts the next latent state one step ahead, similar to DreamerV3, while the long‑term (jumpy) branch predicts a distant future state in a single step conditioned on a goal. To generate these jumpy predictions, the method repeatedly zooms into localized regions of the current RGB observation, creates a synthetic 16‑frame video clip, and evaluates its relevance to the textual task instruction using the pre‑trained MineCLIP video‑language model. The relevance scores are aggregated across sliding windows to produce an affordance map, which highlights image pixels most likely to contribute to task completion.

An intrinsic reward is derived from the affordance map by weighting it with a centered 2‑D Gaussian, encouraging the agent to move the task‑relevant region toward the center of its view. The probability of invoking a jumpy transition is computed from the kurtosis of the affordance map: a relative kurtosis term captures how much a region stands out, and an absolute kurtosis term measures confidence. Their product, passed through a sigmoid, yields a jump probability; if it exceeds a dynamic threshold (mean plus one standard deviation of recent probabilities), a jump flag is set and the world model uses the long‑term branch for the next imagined step.

Training proceeds in a loop: (1) compute affordance maps via exhaustive zoom‑in and MineCLIP scoring; (2) train a multimodal U‑Net (Swin‑Unet backbone) to predict affordance maps from raw images and instructions, enabling fast inference; (3) train the dual‑branch world model on a replay buffer containing both adjacent‑step pairs and long‑distance pairs filtered by high affordance scores; (4) update the policy with an actor‑critic algorithm that consumes mixed imagination rollouts (short‑ and jump‑steps) and a composite reward consisting of environment sparse reward, MineCLIP reward, and the affordance‑driven intrinsic reward; (5) collect new trajectories, recompute affordance maps, and refresh the buffer.

Experiments on MineDojo—a suite of Minecraft‑style open‑world tasks such as “cut a tree”, “explore a cave”, and “mine ore”—show that LS‑Imagine outperforms strong baselines including PPO‑with‑MineCLIP, DECKARD, DreamerV3, VPT, and Voyager. The method achieves higher success rates and learns faster, especially on tasks where the goal is initially far from the agent and requires a sequence of coordinated actions. Ablation studies confirm that both the jumpy transition mechanism and the affordance‑based intrinsic reward contribute significantly to performance gains.

The paper’s contributions are: (i) a novel world‑model architecture that blends short‑term and goal‑conditioned long‑term predictions; (ii) a systematic procedure for generating affordance maps via image zoom‑in and video‑text alignment; (iii) an intrinsic reward formulation that leverages these maps to bias exploration toward task‑relevant regions; and (iv) a behavior‑learning pipeline that integrates long‑term value estimates into policy optimization. Limitations include the need for an initial random data collection phase to bootstrap affordance maps, sensitivity of jump decisions to affordance quality, and the current focus on block‑based visual domains, leaving generalization to continuous physics environments as future work. Overall, LS‑Imagine demonstrates that extending imagination horizons through goal‑conditioned jumpy transitions and affordance‑guided intrinsic rewards can substantially improve exploration efficiency and sample efficiency in high‑dimensional open‑world reinforcement learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment