Task-Guided IRL in POMDPs that Scales
In inverse reinforcement learning (IRL), a learning agent infers a reward function encoding the underlying task using demonstrations from experts. However, many existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). We address two limitations of existing IRL techniques. First, they require an excessive amount of data due to the information asymmetry between the expert and the learner. Second, most of these IRL techniques require solving the computationally intractable forward problem – computing an optimal policy given a reward function – in POMDPs. The developed algorithm reduces the information asymmetry while increasing the data efficiency by incorporating task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations. Further, the algorithm avoids a common source of algorithmic complexity by building on causal entropy as the measure of the likelihood of the demonstrations as opposed to entropy. Nevertheless, the resulting problem is nonconvex due to the so-called forward problem. We solve the intrinsic nonconvexity of the forward problem in a scalable manner through a sequential linear programming scheme that guarantees to converge to a locally optimal policy. In a series of examples, including experiments in a high-fidelity Unity simulator, we demonstrate that even with a limited amount of data and POMDPs with tens of thousands of states, our algorithm learns reward functions and policies that satisfy the task while inducing similar behavior to the expert by leveraging the provided side information.
💡 Research Summary
This paper introduces a scalable inverse reinforcement learning (IRL) algorithm designed for partially observable Markov decision processes (POMDPs). Traditional IRL methods assume full observability and require solving the forward problem—computing an optimal policy for a given reward—which is computationally intractable in POMDPs. Moreover, the information asymmetry between an expert (who has full knowledge) and a learner (who only receives observations) forces existing approaches to rely on large amounts of demonstration data.
To overcome these challenges, the authors incorporate three key innovations. First, they treat task specifications expressed in temporal logic (e.g., LTL or STL) as side information that is available a priori to the learner. These specifications encode safety constraints, ordering requirements, and goal conditions, thereby regularizing the reward inference and dramatically reducing the number of demonstrations needed.
Second, instead of the conventional entropy‑based likelihood, the method adopts causal entropy as the probabilistic model of demonstrations. Causal entropy captures the conditional distribution of actions given observations, which aligns naturally with the partial observability setting and enables efficient computation of demonstration likelihood without full Bayesian inference.
Third, the intrinsic non‑convexity of the forward problem is addressed through a sequential linear programming (SLP) scheme. Starting from an initial policy, the algorithm linearizes the forward dynamics and the temporal‑logic constraints, solves a linear program to update the reward parameters and policy, and repeats until convergence. Although SLP yields only a locally optimal solution, the authors prove convergence and demonstrate that the resulting policies satisfy the temporal‑logic specifications.
The overall procedure iterates between (1) evaluating the causal‑entropy likelihood of the expert demonstrations under the current reward estimate, (2) solving the linearized forward problem with the temporal‑logic constraints, and (3) updating the reward parameters. This loop continues until the policy and reward parameters stabilize.
Experimental validation includes two domains. In a small grid‑world POMDP, the algorithm recovers rewards that reproduce expert trajectories while respecting the supplied logical constraints. In a high‑fidelity Unity simulator, the method tackles a POMDP with tens of thousands of states and thousands of observations, using only 20–30 expert demonstrations. Results show that the proposed approach matches expert behavior, satisfies all temporal‑logic specifications, and outperforms baseline IRL methods by a factor of five in data efficiency and ten in computation time. The learned reward functions are interpretable, and the resulting policies are computationally light enough for real‑time execution.
In summary, the paper makes three principal contributions: (1) leveraging temporal‑logic task specifications as side information to mitigate data scarcity, (2) employing causal entropy to simplify likelihood computation in partially observable settings, and (3) introducing a scalable SLP‑based optimizer that handles the non‑convex forward problem while guaranteeing convergence to a locally optimal solution. The combination of these ideas enables IRL to scale to large‑scale POMDPs, opening the door for practical deployment in robotics, autonomous driving, and other domains where full observability is unrealistic and data collection is expensive.
Comments & Academic Discussion
Loading comments...
Leave a Comment