Offline Discovery of Interpretable Skills from Multi-Task Trajectories
Hierarchical Imitation Learning is a powerful paradigm for acquiring complex robot behaviors from demonstrations. A central challenge, however, lies in discovering reusable skills from long-horizon, multi-task offline data, especially when the data lacks explicit rewards or subtask annotations. In this work, we introduce LOKI, a three-stage end-to-end learning framework designed for offline skill discovery and hierarchical imitation. The framework commences with a two-stage, weakly supervised skill discovery process: Stage one performs coarse, task-aware macro-segmentation by employing an alignment-enforced Vector Quantized VAE guided by weak task labels. Stage two then refines these segments at a micro-level using a self-supervised sequential model, followed by an iterative clustering process to consolidate skill boundaries. The third stage then leverages these precise boundaries to construct a hierarchical policy within an option-based framework-complete with a learned termination condition beta for explicit skill switching. LOKI achieves high success rates on the challenging D4RL Kitchen benchmark and outperforms standard HIL baselines. Furthermore, we demonstrate that the discovered skills are semantically meaningful, aligning with human intuition, and exhibit compositionality by successfully sequencing them to solve a novel, unseen task.
💡 Research Summary
The paper introduces LOKI, a three‑stage framework for offline discovery of interpretable robot skills from long‑horizon, multi‑task demonstration data. The authors identify three major challenges in hierarchical imitation learning (HIL): ambiguous skill boundaries, lack of reusable and human‑aligned skills, and poor scalability to multi‑task settings. LOKI addresses these by first performing a coarse, task‑aware macro‑segmentation using an Enforced Vector‑Quantized Variational Auto‑Encoder (EVQ‑VAE). The EVQ‑VAE extends the standard VQ‑VAE by conditioning the quantization on weak task labels and adding a codebook divergence loss that forces distinct codebook vectors to stay far apart, thus preventing collapse. Entropy of the softmax‑normalized distances between encoder outputs and codebook entries is used as a signal for change‑point detection; low entropy indicates task‑specific (extrinsic) skills, while high entropy signals more generic (intrinsic) behaviors. The PELT algorithm identifies these entropy spikes, yielding “General Skills” (GS) that roughly correspond to macro‑segments of the trajectories.
In the second stage, each GS is refined into finer “Independent Skills” (IS) via a weakly‑supervised micro‑segmentation. A sliding‑window sequential VAE reconstructs fixed‑length windows sampled from the macro‑segments. The loss combines reconstruction error with a KL‑divergence term that regularizes the posterior over latent codes against a prior conditioned only on the window’s initial state and the weak task label. This design ensures that the latent code z, which represents a skill, can be inferred from the current state alone, a crucial property for execution without future information. Temporal stability of the latent codes and an iterative clustering with task‑skill alignment further sharpen the skill boundaries.
The third stage builds an executable hierarchical policy. Each discovered IS becomes an option with a low‑level policy π_low and a learned termination function β. A high‑level meta‑controller π_high selects which option to activate based on the current state and task goal. By learning β, LOKI explicitly models when a skill should terminate, reducing discontinuities at skill switches and enabling smooth composition.
Experiments are conducted on the D4RL Kitchen benchmark, which involves complex cooking tasks requiring multiple sub‑behaviors (e.g., opening a fridge, picking ingredients, turning on a stove). LOKI achieves significantly higher success rates than prior HIL baselines such as OptionGAN and HAC, while using the same offline data. Qualitative analysis shows that the discovered skills align with human‑intuitive actions, and the authors demonstrate compositionality by recombining existing skills to solve a novel, unseen recipe. Ablation studies confirm that both the task‑conditional quantization in EVQ‑VAE and the KL‑regularized micro‑segmentation are essential for performance and interpretability.
Key contributions include: (1) a novel EVQ‑VAE architecture that leverages weak supervision for macro‑segmentation; (2) a weakly‑supervised micro‑segmentation pipeline that yields high‑precision, task‑independent skill boundaries; (3) integration of learned termination functions within an option‑based hierarchical policy; and (4) empirical evidence of skill interpretability and compositionality on a challenging benchmark.
The paper also discusses limitations: sensitivity to hyper‑parameters such as window size and codebook cardinality, potential over‑segmentation in very long trajectories, and the current focus on low‑level joint control without direct integration of vision‑language inputs. Future work is suggested to extend LOKI to vision‑language‑action models, real‑time robot deployment, and automated hyper‑parameter tuning for broader applicability. Overall, LOKI presents a compelling solution for extracting human‑aligned, reusable skills from offline multi‑task data and demonstrates that such skills can be effectively leveraged for hierarchical imitation learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment