Reducing Commitment to Tasks with Off-Policy Hierarchical Reinforcement Learning

Reducing Commitment to Tasks with Off-Policy Hierarchical Reinforcement   Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In experimenting with off-policy temporal difference (TD) methods in hierarchical reinforcement learning (HRL) systems, we have observed unwanted on-policy learning under reproducible conditions. Here we present modifications to several TD methods that prevent unintentional on-policy learning from occurring. These modifications create a tension between exploration and learning. Traditional TD methods require commitment to finishing subtasks without exploration in order to update Q-values for early actions with high probability. One-step intra-option learning and temporal second difference traces (TSDT) do not suffer from this limitation. We demonstrate that our HRL system is efficient without commitment to completion of subtasks in a cliff-walking domain, contrary to a widespread claim in the literature that it is critical for efficiency of learning. Furthermore, decreasing commitment as exploration progresses is shown to improve both online performance and the resultant policy in the taxicab domain, opening a new avenue for research into when it is more beneficial to continue with the current subtask or to replan.


💡 Research Summary

The paper investigates a subtle but critical problem that arises when off‑policy temporal‑difference (TD) methods are used within hierarchical reinforcement learning (HRL) systems. Traditional HRL approaches, such as Hierarchical Semi‑Markov Q‑learning (HSMQ) and variants that rely on all‑goals or all‑states updating, implicitly assume that once a subtask is started the agent must see it through to termination (a “commitment” to the subtask). This assumption is required because off‑policy backups in methods like Watkins’ Q(λ) or standard Q‑learning become invalid when a non‑greedy (exploratory) action is taken inside a subtask; the higher‑level task would otherwise incorporate the exploratory reward and perform an unintended on‑policy update.

The authors first illustrate the issue with a simple three‑armed bandit problem that is hierarchically decomposed: the root task can either execute a primitive action B (reward 10) or invoke a subtask that chooses between actions A (reward 1) and C (reward 100). If the subtask explores (e.g., chooses A with ε‑greedy probability), the root’s Q‑value for the subtask is contaminated by the exploratory reward, causing the root to prefer the primitive action B half of the time, even though the optimal policy is to always select the subtask and let it eventually choose C. In standard Q‑learning or Q(λ) this contamination forces the algorithm to either discard the backup or clear the eligibility trace whenever a non‑greedy action occurs, effectively enforcing a commitment to complete the subtask before any higher‑level learning can happen.

To overcome this dilemma, the paper proposes two modifications that allow genuine off‑policy learning without requiring full commitment:

  1. One‑step intra‑option learning – When a non‑greedy action occurs in any subtask, the higher‑level backup is simply skipped for that step, and the Q‑update uses the same action’s value from the successor state (i.e., Q(s,a) ← r + γ Q(s′,a)). Because the update is local and the exploration policy satisfies GLIE (greedy in the limit with infinite exploration), convergence is guaranteed without needing the subtask to finish.

  2. Temporal Second‑Difference Traces (TSDT) with gating – TSDT stores local TD errors (δ) for intra‑option learning and can perform off‑policy backups after non‑greedy actions. The authors modify TSDT so that when a non‑greedy action is taken, the trace does not receive a new entry, but previously stored δ’s remain. This preserves much of the trace’s informational flow while preventing contamination from exploratory steps. In deterministic domains, TSDT can propagate reward from the end of a task back to its start within a single episode, achieving performance comparable to Q(λ) with all‑states updating but without the commitment requirement.

Building on these mechanisms, the authors introduce Off‑Policy Hierarchical Reinforcement Learning (OPHRL), an algorithm that executes in a non‑hierarchical (polling) fashion: at each decision step the current task is called, and the algorithm may abandon a subtask at any time. OPHRL incorporates task‑specific reward rejection and transformation functions to address hierarchical credit‑assignment issues (e.g., when a subtask receives a penalty that should not affect the parent). The algorithm guarantees convergence to the true value function for each task as long as the exploration policy is non‑starving.

Empirical evaluation is performed in two benchmark domains:

  • Cliff‑walking – A classic gridworld where stepping off the safe path incurs a large negative reward. The authors compare a flat Q‑learning agent, a conventional HRL agent that commits to subtask completion, a naïve HRL agent (without the proposed fixes), and OPHRL with one‑step intra‑option learning and gated TSDT. Results show that naïve HRL fails to learn the optimal policy because the root task’s Q‑values are corrupted by subtask exploration. OPHRL, even without any commitment, learns as quickly as the flat agent and outperforms the committed HRL baseline. Moreover, when the probability of committing to a subtask is gradually reduced as exploration proceeds (a “commit‑reduction schedule”), online performance improves markedly and the final policy is superior.

  • Taxi domain – A more complex environment with stochastic passenger pick‑up/drop‑off and multiple subtasks (navigate, pick up, drop off). Similar trends are observed: OPHRL with decreasing commitment achieves higher cumulative reward during learning and converges to a policy with lower average steps per episode than both flat and committed HRL agents.

The paper’s key contributions are: (1) identifying the hidden on‑policy bias that off‑policy TD methods suffer in hierarchical settings when exploration occurs in subtasks; (2) providing concrete algorithmic fixes (one‑step intra‑option learning and gated TSDT) that preserve off‑policy correctness without forcing subtask completion; (3) demonstrating empirically that eliminating the commitment requirement not only preserves learning correctness but can actually accelerate learning and improve final policies; and (4) opening a new line of inquiry into adaptive commitment strategies—when it is beneficial to persist with a subtask versus abandoning it for exploration.

In summary, the work challenges the long‑standing belief that hierarchical reinforcement learning must enforce full task commitment to be efficient. By carefully modifying off‑policy TD updates and allowing dynamic reduction of commitment, the authors show that HRL can be both theoretically sound and practically more effective. This insight has significant implications for the design of scalable, hierarchical agents in complex, real‑world domains where exploration and flexibility are paramount.


Comments & Academic Discussion

Loading comments...

Leave a Comment