ARROW: Augmented Replay for RObust World models
Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.
💡 Research Summary
The paper addresses the problem of catastrophic forgetting in continual reinforcement learning (CRL), where an agent must acquire new skills over a sequence of tasks while preserving performance on previously learned tasks. Existing model‑free CRL methods typically rely on large replay buffers to rehearse past experiences, but the memory requirements of such buffers quickly become prohibitive as the number of tasks grows. Inspired by the Complementary Learning Systems theory in neuroscience, the authors propose ARROW (Augmented Replay for Robust World models), a model‑based CRL algorithm that extends the state‑of‑the‑art DreamerV3 architecture with a memory‑efficient, distribution‑matching replay mechanism.
ARROW introduces two complementary replay buffers instead of a single FIFO buffer. The short‑term buffer (D₁) stores the most recent 2¹⁸ observations in a FIFO fashion, ensuring that the world model receives a strong recency bias and can quickly adapt to the current task. The long‑term buffer (D₂) also holds 2¹⁸ observations but uses reservoir sampling with random keys to maintain a uniform random subset of spliced rollouts drawn from the entire task history. This “global distribution‑matching” buffer preserves task diversity under a strict memory budget, mitigating the drift of the training distribution that typically causes forgetting. Both buffers store spliced rollouts of length 512 steps, which reduces storage overhead compared to storing full episodes while still providing sufficient trajectory diversity.
The world model component is identical to DreamerV3’s Recurrent State‑Space Model (RSSM). It combines a deterministic hidden state hₜ with a stochastic latent state zₜ, learns to reconstruct observations and rewards, and is trained with KL‑balancing to stabilize latent dynamics. The actor‑critic controller consists of separate MLPs for policy and value estimation; both are trained exclusively on imagined trajectories generated by the world model (“dreaming”). Because the model can generate arbitrarily many imagined rollouts, ARROW reduces the need for on‑policy data collection, which is especially valuable when the environment is only intermittently accessible during task switches.
To handle tasks that lack any shared structure (e.g., different Atari games), ARROW adopts the same fixed‑entropy regularization and pre‑scaled reward normalization used in DreamerV3. This prevents the policy from becoming overly stochastic or overly conservative when it encounters a new game without explicit task identifiers. No additional exploration module (such as Plan2Explore) is required, keeping the algorithm lightweight.
Experiments are conducted on two benchmark suites. The first suite consists of six diverse Atari games (Ms. Pac‑Man, Boxing, Crazy Climber, Frostbite, Seaquest, Enduro) that share no visual or dynamics structure. The second suite builds on Procgen’s CoinRun environment and introduces six progressive visual and behavioral perturbations (No Background, Restricted Themes, Generated Assets, Monochrome Assets, Centered Agent, etc.), creating a continuum of tasks with increasing shared structure. For each suite the authors evaluate three training schedules: (i) a default task order, (ii) the reverse order, and (iii) a two‑cycle schedule where the same sequence of tasks is presented twice, allowing measurement of relearning and cross‑cycle adaptation.
All methods are constrained to the same replay memory budget of 2¹⁹ observations (≈524 k). ARROW’s combined buffers thus hold 1 024 spliced rollouts (512 in D₁ and 512 in D₂). Competing baselines include DreamerV3 (model‑based) and TES‑SAC (a target‑entropy‑scheduled variant of Soft Actor‑Critic, model‑free), each with a single FIFO buffer of equivalent size. Single‑task agents are also trained to provide normalization baselines.
Performance is measured using a suite of CRL metrics: average episodic return, ACC, Min‑ACC, WC‑ACC (weighted‑capacity ACC), forgetting (as defined by Kessler et al. 2023), forward transfer, backward transfer, and, for the two‑cycle setting, a new “Maximum‑for‑getting” (Max‑F) metric that captures the peak performance achieved after a task is revisited. The authors also report recovery speed when a previously seen task reappears.
Results show that, under identical memory constraints, ARROW dramatically reduces forgetting on the Atari suite—approximately a 30 % reduction compared to both CLEAR (model‑free) and the vanilla DreamerV3 baseline—while maintaining comparable forward transfer scores. On the CoinRun suite, where tasks share dynamics and visual features, ARROW’s long‑term buffer preserves the common structure, leading to forward and backward transfer results on par with or better than the baselines. In the two‑cycle experiments, ARROW exhibits rapid performance recovery when a task is revisited, confirming that D₂ successfully retains critical experiences from earlier phases.
The paper’s contributions can be summarized as follows: (1) a novel dual‑buffer replay architecture (short‑term FIFO + long‑term distribution‑matching) tailored for model‑based continual RL; (2) a reservoir‑sampling strategy that enables global training‑distribution matching within a strict memory budget; (3) the use of spliced rollouts to balance storage efficiency with trajectory diversity; and (4) an extensive empirical evaluation across both non‑shared and shared‑structure task streams, demonstrating that model‑based approaches can achieve superior memory efficiency without sacrificing performance.
Future work suggested by the authors includes (i) dynamic allocation of buffer capacity based on task similarity or detected drift, (ii) integration with more sophisticated exploration mechanisms such as Plan2Explore, and (iii) scaling the approach to more complex, high‑dimensional continual learning domains such as robotic manipulation or autonomous driving. Overall, ARROW provides a compelling proof‑of‑concept that bio‑inspired replay strategies combined with world‑model learning can substantially improve the scalability and robustness of continual reinforcement learning systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment