Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)
In this work, we propose a novel framework for the logical specification of non-Markovian rewards in Markov Decision Processes (MDPs) with large state spaces. Our approach leverages Linear Temporal Logic Modulo Theories over finite traces (LTLfMT), a more expressive extension of classical temporal logic in which predicates are first-order formulas of arbitrary first-order theories rather than simple Boolean variables. This enhanced expressiveness enables the specification of complex tasks over unstructured and heterogeneous data domains, promoting a unified and reusable framework that eliminates the need for manual predicate encoding. However, the increased expressive power of LTLfMT introduces additional theoretical and computational challenges compared to standard LTLf specifications. We address these challenges from a theoretical standpoint, identifying a fragment of LTLfMT that is tractable but sufficiently expressive for reward specification in an infinite-state-space context. From a practical perspective, we introduce a method based on reward machines and Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity. We evaluate this approach to a continuous-control setting using Non-Linear Arithmetic Theory, showing that it enables natural specification of complex tasks. Experimental results show how a tailored implementation of HER is fundamental in solving tasks with complex goals.
💡 Research Summary
The paper tackles two fundamental challenges in logic‑based reward specification for reinforcement learning: (i) the inability of traditional Linear Temporal Logic over finite traces (LTLf) to handle non‑Boolean, heterogeneous state information, and (ii) the severe reward sparsity that typically accompanies temporally‑specified goals. To address (i), the authors introduce LTLfMT (Linear Temporal Logic Modulo Theories), an extension of LTLf where atomic propositions are first‑order formulas interpreted over arbitrary background theories (e.g., non‑linear real arithmetic, integer arithmetic, database theories). By delegating the evaluation of these formulas to off‑the‑shelf SMT solvers, the framework eliminates the need for hand‑crafted labeling functions and enables natural expression of complex predicates such as Euclidean distance constraints, object identifiers, and weight limits.
From a theoretical standpoint, the authors identify a tractable fragment of LTLfMT that (a) avoids additional undecidability beyond that of the underlying theory, (b) admits straightforward translation into finite automata (reward machines), and (c) remains expressive enough to capture most practical goals. The translation procedure constructs a reward machine whose states encode the progress of the temporal formula, while transitions are derived directly from the LTL operators (Next, Until, etc.). The resulting product MDP, combining the original environment with the reward machine, restores the Markov property and can be solved with standard RL algorithms.
For (ii), the paper proposes a novel combination of Counterfactual Experiences for Reward Machines (CRM) and Hindsight Experience Replay (HER). CRM injects synthetic reward‑shaping transitions when the agent fails to satisfy the goal, while HER retrospectively treats the final achieved state as a new goal and replays the episode accordingly. Crucially, the structured nature of LTLfMT goals (first‑order formulas over continuous variables) aligns perfectly with HER’s goal‑relabeling mechanism, allowing the same logical expression to serve both as a reward predicate and as a goal generator.
Empirical evaluation focuses on continuous‑control tasks in MuJoCo, using the Non‑Linear Real Arithmetic (NRA) theory to specify complex objectives such as “reach a target location, pick up an object with a given ID, and keep its weight below a threshold, then deliver it to another location.” Experiments compare three setups: (1) baseline RL with only the raw LTLfMT reward, (2) RL with CRM, and (3) RL with the combined HER‑CRM approach. Results show that HER‑CRM dramatically accelerates learning, achieving higher success rates and lower sample complexity than CRM alone. Moreover, the same logical specification is reused across different robot morphologies and task variations without any additional engineering, demonstrating the reusability promised by the LTLfMT paradigm.
The authors release all code and benchmark data, facilitating reproducibility. In summary, the contribution is threefold: (1) a principled extension of temporal logic that integrates first‑order theories, (2) a tractable fragment and automata‑based compilation pipeline, and (3) a synergistic HER‑CRM scheme that mitigates reward sparsity in continuous domains. This work bridges formal methods and deep RL, offering a scalable, interpretable, and reusable framework for specifying sophisticated, non‑Markovian rewards in high‑dimensional environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment