A Continual Offline Reinforcement Learning Benchmark for Navigation Tasks

A Continual Offline Reinforcement Learning Benchmark for Navigation Tasks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autonomous agents operating in domains such as robotics or video game simulations must adapt to changing tasks without forgetting about the previous ones. This process called Continual Reinforcement Learning poses non-trivial difficulties, from preventing catastrophic forgetting to ensuring the scalability of the approaches considered. Building on recent advances, we introduce a benchmark providing a suite of video-game navigation scenarios, thus filling a gap in the literature and capturing key challenges : catastrophic forgetting, task adaptation, and memory efficiency. We define a set of various tasks and datasets, evaluation protocols, and metrics to assess the performance of algorithms, including state-of-the-art baselines. Our benchmark is designed not only to foster reproducible research and to accelerate progress in continual reinforcement learning for gaming, but also to provide a reproducible framework for production pipelines – helping practitioners to identify and to apply effective approaches.


💡 Research Summary

The paper introduces Continual NavBench, a novel benchmark designed to evaluate continual offline reinforcement learning (CORL) methods on navigation tasks inspired by video games. While existing RL benchmarks (e.g., Atari, Procgen, VizDoom) focus on online interaction and rarely address the challenges of sequential task learning, memory constraints, and production‑grade efficiency, Continual NavBench fills this gap by providing a suite of 3‑D maze environments, human‑collected offline datasets, and a set of evaluation protocols and metrics tailored to continual learning.

Two families of mazes are built in the Godot engine: SimpleTown (8 maps, 20 m × 20 m) and AmazeVille (8 maps, 60 m × 60 m). The latter includes high (non‑jumpable) and low (jumpable) blocks, creating richer dynamics. Human players generated roughly 10 hours of gameplay (≈2 800 trajectories), yielding 3000 episodes for SimpleTown and 1000 for AmazeVille. Each transition records position, orientation, velocity, RGB (64 × 64), a low‑resolution depth map (11 × 11), and contact flags, enabling both low‑dimensional state‑based and pixel‑based experiments.

Task streams are defined in two categories. Random streams shuffle the order of mazes and repeat some maps, testing an agent’s ability to recognize and reuse previously learned strategies without explicit cues. Topological streams arrange mazes so that structural changes (e.g., opening/closing doors, adding/removing blocks) occur gradually, mimicking real‑world game updates. Four streams are provided (AR1, AR2, AT1, AT2 for AmazeVille; ST1, ST2 for SimpleTown), each containing four tasks, with some tasks appearing multiple times.

The benchmark adopts a hierarchical goal‑conditioned imitation learning backbone, Hierarchical Goal‑Conditioned Behavioral Cloning (HGCBC). A high‑level policy selects intermediate sub‑goals (waypoints) while a low‑level policy generates primitive actions toward the chosen sub‑goal. The loss functions are standard cross‑entropy on sub‑goal prediction (high level) and action prediction (low level). Hindsight Experience Replay (HER) is used to relabel transitions with alternative sub‑goals, enriching the sparse‑reward dataset.

Six standard continual‑learning metrics are reported: Performance (PER – average success rate across all tasks), Backward Transfer (BWT – change in performance on earlier tasks after learning later ones), Forward Transfer (FWT – benefit to new tasks from prior knowledge), Relative Model Size (MEM – parameter count relative to a reference model), Inference Cost (INF – time per forward pass), and Training Cost (TRN – total training time).

A comprehensive set of baselines spanning four methodological families is evaluated:

  1. Naïve approaches – From‑Scratch (single policy trained on the latest task), Freeze (policy trained on first task and never updated), Finetune (single policy continuously updated or cloned per task).
  2. Replay‑based – Experience Replay (aggregate all past datasets in a single buffer and train a single policy).
  3. Weight‑regularization – Elastic Weight Consolidation (EWC) and L2‑regularization, which penalize changes to important parameters identified from previous tasks.
  4. Architectural – Progressive Neural Networks (PNN) that add a new column per task with lateral connections, and Hierarchical Subspace of Policies (HiSPO) that introduces new anchor sub‑networks per task and prunes them based on a loss threshold.

All experiments use identical network architectures (Residual MLPs with three layers of 256 units, LayerNorm, GELU) and training hyper‑parameters (batch size 64, learning rate 3 × 10⁻⁴, 10⁵ gradient steps). The sub‑goal horizon k is set to 5 for SimpleTown and 10 for AmazeVille; HER sampling temperatures are 100.0 and 15.0 respectively. For EWC and L2, five regularization strengths (10⁻² to 10²) are swept and the best performing configuration per stream is selected.

Key findings:

  • Replay‑based and architectural methods achieve the highest PER and BWT, indicating strong resistance to catastrophic forgetting. PNN, in particular, shows almost zero degradation on earlier tasks.
  • However, Replay‑based methods incur large memory footprints (high MEM) and increased training time, while HiSPO’s model size grows with each new anchor, limiting scalability for long streams.
  • Weight‑regularization methods (EWC, L2) are the most memory‑efficient (low MEM, low INF) but suffer substantial drops in PER and BWT on the more complex AmazeVille streams, where dynamics change dramatically.
  • Naïve baselines are lightweight but exhibit severe negative backward transfer and limited forward transfer, confirming that simple finetuning or freezing is insufficient for continual navigation.
  • Across all methods, the trade‑off between forgetting mitigation and resource consumption is evident, underscoring the need for algorithms that can balance stability, plasticity, and efficiency.

The authors release the full codebase, datasets, and evaluation scripts, enabling reproducible research and facilitating integration into production pipelines where inference speed, memory budget, and model scalability are critical. By providing human‑generated trajectories, the benchmark also supports imitation‑learning approaches, allowing direct comparison with pure RL methods.

Implications and future directions: Continual NavBench establishes a standardized platform for studying continual offline RL in navigation, a domain previously lacking such resources. It encourages the community to develop memory‑aware replay strategies, meta‑learning based transfer mechanisms, and multimodal perception models that can operate under strict production constraints. The benchmark’s design—combining realistic game‑style mazes, human data, and comprehensive metrics—makes it a valuable tool for both academic research and industry deployment of adaptive game‑playing agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment