OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.

💡 Research Summary

The paper introduces OSIL, a novel offline safe imitation learning framework that learns a policy which both maximizes task performance and respects safety constraints without any per‑step reward or cost annotations. The setting assumes access to two offline datasets: (i) a small set of “non‑preferred” trajectories that achieve high returns but incur large safety violations, and (ii) a large “union” dataset of high‑return trajectories whose safety costs vary widely. The key insight is that the non‑preferred trajectories implicitly encode unsafe behavior, and this signal can be harvested to learn a cost model.

OSIL first formulates the problem as a Constrained Markov Decision Process (CMDP) but, unlike standard CMDPs, it does not have access to the true reward r(s,a) or cost c_i(s,a). Instead, it defines a parametric cost model (\tilde{c}=g\circ f) where (f) is an encoder mapping state‑action pairs to a unit‑norm latent vector and (g) is a linear head producing a scalar in (

OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories

💡 Research Summary

Comments & Academic Discussion

Leave a Comment