오메가 정규 목표와 제약을 결합한 모델 기반 강화학습

Reading time: 5 minute
...

📝 Abstract

Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of $\omega $-regular objectives addresses this by precisely specifying rich behavioural properties. Even still, measuring performance by a single scalar (be it reward or satisfaction probability) masks safety-performance trade-offs that arise in settings with a tolerable level of risk. We address both limitations simultaneously by combining $\omega $-regular objectives with explicit constraints, allowing safety requirements and optimisation targets to be treated separately. We develop a model-based RL algorithm based on linear programming, which in the limit produces a policy maximising the probability of satisfying an $\omega $-regular objective while also adhering to $\omega $-regular constraints within specified thresholds. Furthermore, we establish a translation to constrained limit-average problems with optimality-preserving guarantees.

💡 Analysis

Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of $\omega $-regular objectives addresses this by precisely specifying rich behavioural properties. Even still, measuring performance by a single scalar (be it reward or satisfaction probability) masks safety-performance trade-offs that arise in settings with a tolerable level of risk. We address both limitations simultaneously by combining $\omega $-regular objectives with explicit constraints, allowing safety requirements and optimisation targets to be treated separately. We develop a model-based RL algorithm based on linear programming, which in the limit produces a policy maximising the probability of satisfying an $\omega $-regular objective while also adhering to $\omega $-regular constraints within specified thresholds. Furthermore, we establish a translation to constrained limit-average problems with optimality-preserving guarantees.

📄 Content

Reinforcement learning (RL) (Sutton & Barto, 2018) aims to train agents to accomplish tasks in initially unknown environments purely through interaction. Tasks are often specified using reward functions, but this approach can fail to capture high-level objectives. Many real-world goals are temporal (“eventually reach a target”), conditional (“if signal A occurs, ensure B”), or safety-critical (“always avoid unsafe regions”), and scalar rewards cannot express them naturally. Reward-based formulations are also prone to reward hacking, where agents achieve high reward without fulfilling the intended task.

Consequently, there has been growing interest (Wolff et al., 2012;Ding et al., 2014;Voloshin et al., 2022a;Cai et al., 2023;Fu & Topcu, 2014) in specifying tasks using temporal logics, such as Linear Temporal Logic (LTL), or more generally ω-regular objectives Baier & Katoen (2008). These formalisms allow rich combinations of safety, liveness, and sequencing requirements to be expressed compositionally. For instance, LTL can encode tasks like “reach a goal state, but always avoid unsafe regions” or “visit regions A and B infinitely often”, providing clarity and modularity absent in scalar rewards.

Despite their expressivity, ω-regular objectives have limitations. They measure performance as a single satisfaction probability, which can obscure important distinctions between qualitatively different behaviours. In a simple reach-avoid task, one policy may slightly increase the probability of reaching the goal by frequently entering unsafe regions, while another policy may achieve a marginally lower reach probability with substantially higher safety. Optimising solely for reach-avoid probability would favour the first policy, even though the second is arguably much safer (see Fig. 1). This motivates enriching ω-regular objectives with explicit constraints, separating musthave properties, such as safety, from optimisation targets, such as reachability. Constraints extend expressivity beyond what a single ω-regular formula can capture, more faithfully reflecting the trade-offs present in practical decision-making problems. Example MDP where the objective is to reach target states (labelled t) whilst avoiding unsafe states (labelled u). The LTL objective (□¬u) ∧ (⋄t) defines runs where unsafe states are always avoided and a target state is eventually reached. Its optimal policy selects action b, resulting in the unsafe state with 40% probability. Selecting a is always safe and reaches the target state with 50% probability.

• We introduce a reinforcement learning framework with both ω-regular (LTL) objectives and constraints, which is strictly more expressive than existing formulations and requires more general policy classes.

• We develop a model-based algorithm for this framework leveraging linear programs.

• We present a translation to constrained limit-average problems and establish a corresponding optimality guarantee.

Multiple ω-regular objectives have been extensively studied in probabilistic model checking, where the MDP is assumed to be known (e.g. (Etessami et al., 2008;Forejt et al., 2011)). In particular, similar LP-based planning formulations have been explored in this setting.

In contrast, reinforcement learning typically assumes no prior knowledge of the MDP, and to the best of our knowledge, there is very limited prior work addressing ω-regular or LTL constraints (as opposed to a single ω-regular objective). For unconstrained ω-regular objectives, Perez et al. (2024) propose a model-based approach that estimates a surrogate MDP from data and performs planning on the approximation, while Le et al. (2024) introduce an optimality-preserving translation to average-reward problems.

Finally, the field of Safe RL (see (Gu et al., 2024;Wachi et al., 2024) for surveys) typically studies constrained problems defined by scalar reward and cost signals under discounting, which thus have limited ability to specify natural temporal properties.

We work with finite MDPs M = (S, A, s 0 , P), and use D(X) to denote distributions over countable sets X. The transition kernel P of an MDP is a function P : S × A → D(S).

ω-regular specifications. Let AP be a finite set of atomic propositions. An ω-regular specification is a language L ⊆ (2 AP ) ω over infinite words of valuations of AP. The class of ω-regular languages coincides with languages recognised by deterministic Rabin automata (DRAs) and includes languages definable in Linear Temporal Logic (LTL).

Policies. In standard constrained MDPs with discounted objectives and constraints it is sufficient to consider stochastic, stationary policies (Altman, 2021), which are functions π : S → D(A). As we will see, this is not the case for ω-regular objectives and constraints. A more general class of policies π : U × S → D(A) has access to potentially countably infinite memory U such as the entire history of transitions (U (S × A) * ) and assigns distributions over ac

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut