Grounding LTL Tasks in Sub-Symbolic RL Environments for Zero-Shot Generalization

Grounding LTL Tasks in Sub-Symbolic RL Environments for Zero-Shot Generalization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this work we address the problem of training a Reinforcement Learning agent to follow multiple temporally-extended instructions expressed in Linear Temporal Logic in sub-symbolic environments. Previous multi-task work has mostly relied on knowledge of the mapping between raw observations and symbols appearing in the formulae. We drop this unrealistic assumption by jointly training a multi-task policy and a symbol grounder with the same experience. The symbol grounder is trained only from raw observations and sparse rewards via Neural Reward Machines in a semi-supervised fashion. Experiments on vision-based environments show that our method achieves performance comparable to using the true symbol grounding and significantly outperforms state-of-the-art methods for sub-symbolic environments.


💡 Research Summary

The paper tackles the challenging problem of training reinforcement‑learning agents to follow temporally‑extended instructions expressed in Linear Temporal Logic (LTL) when the environment provides only raw sensory data (e.g., images) and no explicit mapping from observations to the atomic propositions that appear in the LTL formulas. Existing multi‑task LTL‑based RL methods typically assume a known labeling function L : S → P that tells which propositions are true in each state. This assumption is unrealistic for many real‑world domains such as robotics or video‑game agents, where the mapping must be learned from scratch.

To address this, the authors build on Neural Reward Machines (NRMs) and formulate a Semi‑Supervised Symbol Grounding (SSSG) problem. An NRM receives a sequence of raw observations and sparse three‑valued rewards ( +1 for a trace that guarantees satisfaction of the LTL goal, –1 for a trace that guarantees falsification, and 0 otherwise). The reward structure follows the LTL 3 semantics and provides minimal but sufficient feedback. By training on many different LTL tasks simultaneously, the agent experiences many progression steps of the formulas, which supplies enough indirect supervision for the grounder to infer the correct mapping between images and symbols, despite the sparsity of the reward signal.

The system consists of four neural modules: (1) a grounder Lθ that maps an observation to a probability distribution over the proposition set P; (2) an image‑feature extractor f_imgθ; (3) an LTL‑feature encoder f_LTLθ that embeds the original formula and all its progressed versions φ(t); and (4) an RL module that receives the concatenated image and LTL embeddings and learns a policy πθ and a value function Vθ (using PPO, SAC, or a similar off‑policy algorithm). At each timestep the most likely proposition p = argmax Lθ(s) is used to progress the current LTL formula via the standard progression operator. The progressed formula is re‑encoded, concatenated with the visual features, and fed to the policy network to select the next action. Unlike LTL2Action, all components are trainable end‑to‑end, so the grounder improves together with the policy.

The authors evaluate the approach on two environments: (a) a Minecraft‑like discrete grid world where the agent sees a top‑down image of the grid and must collect items in a specific order while avoiding lava, and (b) a continuous 2‑D navigation task with visual observations. In both cases the true labeling function is hidden. Training is performed on a set of co‑safe LTL formulas (Φ_train); testing uses unseen formulas (Φ_test) to assess zero‑shot generalization. Baselines include (i) an upper bound that uses the true labeling function, (ii) the state‑of‑the‑art multi‑task method Kuo et al. 2020, and (iii) LTL2Action with a perfect grounder.

Results show that the proposed method attains performance within 1–2 % of the upper bound, dramatically outperforming Kuo et al. by 15–20 % on success rate, and matching or exceeding LTL2Action despite not having access to the ground truth symbols. Crucially, on unseen test formulas the learned policy generalizes zero‑shot: it immediately exploits the LTL progression encoding to handle new task structures without any additional training. Ablation studies confirm that the joint training of the grounder and policy is essential; fixing the grounder to a random mapping leads to failure, while pre‑training the grounder alone yields slower convergence.

Key contributions are: (1) a novel semi‑supervised framework that learns symbol grounding directly from raw observations and sparse LTL‑derived rewards; (2) an end‑to‑end architecture that removes the need for a known labeling function, extending LTL2Action to fully sub‑symbolic settings; (3) empirical evidence of strong zero‑shot generalization across unseen LTL tasks; and (4) a comprehensive evaluation on both discrete and continuous visual domains. The paper opens avenues for future work on non‑co‑safe LTL specifications, multi‑agent collaborative grounding, and deployment on real‑world robotic platforms with high‑dimensional sensory inputs.


Comments & Academic Discussion

Loading comments...

Leave a Comment