Causal Imitation Learning under Expert-Observable and Expert-Unobservable Confounding
We propose a general framework for causal Imitation Learning (IL) with hidden confounders, which subsumes several existing settings. Our framework accounts for two types of hidden confounders: (a) variables observed by the expert but not by the imitator, and (b) confounding noise hidden from both. By leveraging trajectory histories as instruments, we reformulate causal IL in our framework into a Conditional Moment Restriction (CMR) problem. We propose DML-IL, an algorithm that solves this CMR problem via instrumental variable regression, and upper bound its imitation gap. Empirical evaluation on continuous state-action environments, including Mujoco tasks, demonstrates that DML-IL outperforms existing causal IL baselines.
💡 Research Summary
This paper tackles a fundamental challenge in imitation learning (IL): the presence of hidden confounders that bias the learner’s policy when only offline demonstrations are available. Existing works typically address either (a) expert‑observable confounders—variables known to the expert but omitted from the dataset—or (b) expert‑unobservable confounders—pure noise that affects both states and actions but is invisible to everyone. Real‑world systems, however, often contain a mixture of both types, and these confounders may evolve over time and directly influence transition dynamics. To bridge this gap, the authors propose a unified causal IL framework that simultaneously models (i) expert‑observable hidden variables (u_t^o) and (ii) expert‑unobservable additive noise (u_t^\epsilon).
The environment is formalized as an MDP ((S, A, U, P, r, \mu_0, T)) where each hidden confounder at time (t) is split into ((u_t^o, u_t^\epsilon)). The transition function depends on both components, while the reward depends only on (u_t^o). Two key structural assumptions enable identification: (1) a confounding‑noise horizon (k) such that (u_t^\epsilon) is independent of (u_{t-k}^\epsilon) (i.e., the noise decorrelates after a bounded number of steps), and (2) additive noise in the expert’s action generation:
\
Comments & Academic Discussion
Loading comments...
Leave a Comment