실현 가능한 추상화를 통한 효율적인 계층 강화학습
📝 Abstract
The main focus of Hierarchical Reinforcement Learning (HRL) is studying how large Markov Decision Processes (MDPs) can be more efficiently solved when addressed in a modular way, by combining partial solutions computed for smaller subtasks. Despite their very intuitive role for learning, most notions of MDP abstractions proposed in the HRL literature have limited expressive power or do not possess formal efficiency guarantees. This work addresses these fundamental issues by defining Realizable Abstractions, a new relation between generic low-level MDPs and their associated high-level decision processes. The notion we propose avoids non-Markovianity issues and has desirable near-optimality guarantees. Indeed, we show that any abstract policy for Realizable Abstractions can be translated into near-optimal policies for the low-level MDP, through a suitable composition of options. As demonstrated in the paper, these options can be expressed as solutions of specific constrained MDPs. Based on these findings, we propose RARL, a new HRL algorithm that returns compositional and near-optimal low-level policies, taking advantage of the Realizable Abstraction given in the input. We show that RARL is Probably Approximately Correct, it converges in a polynomial number of samples, and it is robust to inaccuracies in the abstraction.
💡 Analysis
The main focus of Hierarchical Reinforcement Learning (HRL) is studying how large Markov Decision Processes (MDPs) can be more efficiently solved when addressed in a modular way, by combining partial solutions computed for smaller subtasks. Despite their very intuitive role for learning, most notions of MDP abstractions proposed in the HRL literature have limited expressive power or do not possess formal efficiency guarantees. This work addresses these fundamental issues by defining Realizable Abstractions, a new relation between generic low-level MDPs and their associated high-level decision processes. The notion we propose avoids non-Markovianity issues and has desirable near-optimality guarantees. Indeed, we show that any abstract policy for Realizable Abstractions can be translated into near-optimal policies for the low-level MDP, through a suitable composition of options. As demonstrated in the paper, these options can be expressed as solutions of specific constrained MDPs. Based on these findings, we propose RARL, a new HRL algorithm that returns compositional and near-optimal low-level policies, taking advantage of the Realizable Abstraction given in the input. We show that RARL is Probably Approximately Correct, it converges in a polynomial number of samples, and it is robust to inaccuracies in the abstraction.
📄 Content
Hierarchical Reinforcement Learning (HRL) is the study of abstractions of decision processes and how they can be used to improve the efficiency and compositionality of RL algorithms (Barto & Mahadevan, 2003;Abel et al., 2018). To pursue these objectives, most HRL methods augment the low-level Markov Decision Process (MDP) (Puterman, 1994) with some form of abstraction, often a simplified state representation or a high-level policy. As was immediately identified (Dayan & Hinton, 1992), compositionality is arguably one of the most important features for HRL algorithms, as it is commonly associated with increased efficiency (Wen et al., 2020) and policy reuse for downstream tasks (Brunskill & Li, 2014;Abel et al., 2018;Tasse et al., 2020;2022). There is a common intuition that drives many authors in HRL. That is, abstract states correspond to sets of ground states, and abstract actions correspond to sequences of ground actions. This was evident since the early work in HRL (Dayan & Hinton, 1992), and was largely derived from hierarchical planning. However, the main question that still remains unanswered is which sequence of ground actions should each abstract action correspond to? The answer to this question also requires the identification of a suitable state abstraction. The resulting notion of MDP abstraction has a strong impact on the applicability and the guarantees of the associated HRL methods.
There is no shared consensus on what “MDP abstractions” should refer to. In the literature, the term is loosely used to refer to a variety of concepts, including state partitions (Abel et al., 2020;Wen et al., 2020), bottleneck states (Jothimurugan et al., 2021b), subtasks (Nachum et al., 2018;Jothimurugan et al., 2021a), options (Precup & Sutton, 1997;Khetarpal et al., 2020), entire MDPs (Ravindran & Barto, 2002;Cipollone et al., 2023), or even the natural language (Jiang et al., 2019). In addition, most HRL methods have been validated only experimentally (Nachum et al., 2018;Jinnai et al., 2020;Jothimurugan et al., 2021b;Lee et al., 2021;2022b), leading to a limited theoretical understanding of general abstractions and their use in RL. Some notable exceptions (Brunskill & Li, 2014;Fruit et al., 2017;Fruit & Lazaric, 2017;Wen et al., 2020) give formal definitions of state and temporal abstractions, and provide formal near-optimality and efficiency guarantees. However, they do not define a high-level decision process, enforcing requirements that are often impractical on the ground MDP directly.
In this work, we propose a new formal definition of MDP abstractions, based on a high-level decision process. This definition enables algorithms, such as the one proposed in this paper, that do not require specific knowledge about the ground MDP. In particular, we identify a new relation that links generic low-level MDPs to their high-level representations, and we use a second-order MDP as the high-level decision process in order to overcome non-Markovian dependencies. As we show, the abstractions we propose are widely applicable, do not incur the non-Markovian effects that are often found in HRL (Bai et al., 2016;Nachum et al., 2018;Jothimurugan et al., 2021b), and provide near-optimality guarantees for their associated low-level policies. Such near-optimal policies only result from the compositions of smaller options, without any global constraint. Due to this feature, we call them Realizable Abstractions. An important feature of our work is also that we do not restrict the cardinality of the state and action spaces for the ground MDP that does not need to be finite. Instead, we require that the abstract decision process has finite state and action sets, so that we can compute an exact tabular representation of the abstract value function.
We also address the associated algorithmic question of how to learn the ground options that realize each high-level behavior. As we show, the realization problem, that is, the problem of learning suitable options from experience, can be cast as a Constrained MDP (CMDP) (Altman, 1999) and solved with off-the-shelf online RL algorithms for CMDPs (Zhang et al., 2020;Ding et al., 2022). Based on these principles, we develop a new algorithm, called RARL (for “Realizable Abstractions RL”), which learns compositional policies for the ground MDP and it is Probably Approximately Correct (PAC) (Fiechter, 1994). An additional novelty of this work is that the proposed algorithm iteratively refines the high-level decision process given in input by sampling in the ground MDP and it exploits the solution obtained from the current abstraction to drive exploration in the ground MDP.
The contributions of this work are theoretical and algorithmic. We propose Realizable Abstractions (Definition 2), we show a formal relation between the abstract and the ground values (Theorem 1 and Corollary 3), and we provide original insights on the conditions that must be met to reduce the effective horizon in the ab
This content is AI-processed based on ArXiv data.