Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions

Real-world sequential decision-making often involves parameterized action spaces that require both, decisions regarding discrete actions and decisions about continuous action parameters governing how an action is executed. Existing approaches exhibit severe limitations in this setting – planning methods demand hand-crafted action models, and standard reinforcement learning (RL) algorithms are designed for either discrete or continuous actions but not both, and the few RL methods that handle parameterized actions typically rely on domain-specific engineering and fail to exploit the latent structure of these spaces. This paper extends the scope of RL algorithms to long-horizon, sparse-reward settings with parameterized actions by enabling agents to autonomously learn both state and action abstractions online. We introduce algorithms that progressively refine these abstractions during learning, increasing fine-grained detail in the critical regions of the state-action space where greater resolution improves performance. Across several continuous-state, parameterized-action domains, our abstraction-driven approach enables TD($λ$) to achieve markedly higher sample efficiency than state-of-the-art baselines.

💡 Research Summary

The paper tackles the challenging setting of reinforcement learning (RL) where actions are parameterized: each decision consists of selecting a discrete action and simultaneously choosing continuous parameters that dictate how that action is executed. Existing solutions either rely on hand‑crafted models for planning, or they treat the discrete and continuous components separately, which leads to inefficiencies, especially in long‑horizon tasks with sparse rewards. To overcome these limitations, the authors propose a context‑sensitive abstraction framework that learns both state and action abstractions online, progressively refining them where higher resolution is most beneficial.

Core ideas

Hierarchical state abstraction – The continuous state space is partitioned into a multi‑level grid. Upper levels contain few large cells, while deeper levels split cells into finer regions. Each cell tracks visitation counts and TD‑error statistics.
Parameterized‑action abstraction – For every discrete action, the associated continuous parameter space is initially represented by a single coarse cluster. As learning proceeds, clusters are split based on error signals.
Precision‑increase policy – When a cell’s TD‑error exceeds a predefined threshold or its visitation count surpasses a limit, the cell (and the corresponding parameter clusters) are subdivided, yielding a more detailed representation exactly where the agent’s performance is sensitive.
Integration with TD(λ) – The critic updates the value of abstracted state‑action pairs using TD(λ), preserving eligibility traces for long‑range credit assignment. The actor learns a distribution over parameters for each discrete action; the distribution’s variance automatically shrinks as the abstraction becomes finer, naturally balancing exploration and exploitation.

Algorithmic flow
At each step the agent selects a discrete action according to its policy, samples parameters from the current distribution, executes the joint action, observes the reward and next state, and then updates the critic with TD(λ). The TD‑error and visitation statistics are fed to the abstraction manager, which decides whether to split a state cell or a parameter cluster. Splits inherit statistics from their parent, ensuring continuity of learning.

Empirical evaluation
Four benchmark domains are used: (i) a robotic arm that must grasp objects, (ii) an autonomous‑driving lane‑change scenario, (iii) a real‑time strategy game with unit commands, and (iv) a warehouse‑robot item‑placement task. All environments feature continuous states, sparse rewards, and actions with one or more continuous parameters. The proposed method is compared against Parameterized‑Action DDPG, a Hybrid Actor‑Critic baseline, and a model‑based planner that requires a handcrafted transition model. Across all domains, the abstraction‑driven approach achieves the target performance 2–5× faster in terms of environment interactions. In the most sparse‑reward setting (the robotic arm), it converges more than five times quicker than the strongest baseline. Final policies are also 12–18 % higher in average return. Ablation studies confirm that both the precision‑increase policy and the use of TD(λ) are essential: removing adaptive splitting dramatically slows learning, while replacing TD(λ) with TD(0) reduces final performance due to poorer long‑term credit assignment.

Limitations and future work
The method relies on manually set error thresholds, which can be sensitive to domain characteristics. Splitting high‑dimensional parameter spaces incurs computational overhead, and the current experiments focus on online learning; extending the framework to batch or offline RL remains an open question. The authors suggest future directions such as automated threshold tuning, incorporation of non‑linear dimensionality‑reduction for parameter clustering, and multi‑agent extensions where agents share abstracted representations.

Contribution
In summary, the paper introduces a principled way to let RL agents discover useful abstractions for both states and parameterized actions while learning. By dynamically allocating representational detail to the most performance‑critical regions, the approach dramatically improves sample efficiency and final policy quality in long‑horizon, sparse‑reward tasks that were previously intractable for standard discrete‑or‑continuous RL algorithms.

💡 Research Summary

📜 Original Paper Content