CoMI-IRL: Contrastive Multi-Intention Inverse Reinforcement Learning

CoMI-IRL: Contrastive Multi-Intention Inverse Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Inverse Reinforcement Learning (IRL) seeks to infer reward functions from expert demonstrations. When demonstrations originate from multiple experts with different intentions, the problem is known as Multi-Intention IRL (MI-IRL). Recent deep generative MI-IRL approaches couple behavior clustering and reward learning, but typically require prior knowledge of the number of true behavioral modes $K^$. This reliance on expert knowledge limits their adaptability to new behaviors, and only enables analysis related to the learned rewards, and not across the behavior modes used to train them. We propose Contrastive Multi-Intention IRL (CoMI-IRL), a transformer-based unsupervised framework that decouples behavior representation and clustering from downstream reward learning. Our experiments show that CoMI-IRL outperforms existing approaches without a priori knowledge of $K^$ or labels, while allowing for visual interpretation of behavior relationships and adaptation to unseen behavior without full retraining.


💡 Research Summary

CoMI‑IRL tackles the Multi‑Intention Inverse Reinforcement Learning (MI‑IRL) problem by completely separating behavior clustering from reward learning. Traditional MI‑IRL methods embed a latent categorical code c into a single model, coupling clustering and reward inference and requiring a pre‑specified number of behavior modes K. This coupling creates three major drawbacks: (i) the true number of intentions K* is often unknown, leading to misspecification; (ii) adding new behaviors forces a full retraining of the model; (iii) errors in clustering and reward learning are entangled, making diagnosis and interpretation difficult.

CoMI‑IRL replaces the coupled architecture with an unsupervised contrastive pipeline. First, each trajectory is quantile‑normalized and split into separate state and action streams. Both streams pass through Random Fourier Feature (RFF) encoders, shallow MLPs, and 1‑D CNNs to capture short‑range patterns while mitigating the spectral bias of pure MLPs. Positional (time) embeddings and modality embeddings (state vs. action) are added, the two streams are interleaved, and a CLS token is prepended. A Transformer encoder processes the whole sequence, and the final CLS token is L2‑normalized to lie on the unit hypersphere, yielding a compact trajectory embedding zτ.

Contrastive learning is performed at multiple granularities. Two dropout‑augmented views of each trajectory form a positive pair for a symmetric InfoNCE loss (LCLS). Additionally, n short segments are sampled from each trajectory; InfoNCE is applied between the global embedding and each segment (LSEG) and between all segment pairs (LPAIR) to enforce local‑global consistency and tighten intra‑cluster cohesion. A Deep InfoMax (DIM) term maximizes mutual information between global and local features, stabilizing the representation. The total loss is a weighted sum: L = αLCLS + βLDIM + γLSEG + δLPAIR.

The learned embeddings are clustered without any prior K. A k‑nearest‑neighbor graph is built using cosine similarity as edge weights. If the graph naturally splits into multiple connected components, each component becomes a cluster. When the graph remains fully connected, the Leiden community‑detection algorithm is applied to find modular subgraphs. To enrich the similarity measure, Jacobian‑based sensitivity features (estimated via finite differences) are incorporated as auxiliary edge weights, capturing dynamics independent of reward structure.

For each discovered cluster, a single‑intention IRL algorithm (e.g., GAIL, AIRL, or other deep IRL variants) is run independently, producing a distinct reward function. This decoupling ensures that reward learning operates on semantically coherent behavior groups identified purely from trajectory similarity.

When new, unseen trajectories appear, CoMI‑IRL adapts without retraining the entire pipeline. The Behavioral Encoder is fine‑tuned on a mixture of old and new data, with a stability regularizer (Lstab) that penalizes cosine deviation between the frozen reference embeddings and the updated embeddings, thus preserving the original geometry. After fine‑tuning, a two‑stage clustering is performed: (1) re‑cluster the previously seen data while enforcing the original number of baseline clusters K, recovering centroids and radii in the new embedding space; (2) assign new trajectories that fall within a radius to the corresponding baseline cluster, and treat trajectories outside all radii as novel candidates. Novel candidates are then sub‑clustered using the same graph‑based method, yielding new clusters and triggering reward learning for the newly discovered intentions.

Experiments evaluate (a) the impact of mismatched K versus K*, (b) scenarios where K is unknown, and (c) continual‑learning settings where new behaviors are added. Metrics include clustering quality (NMI, ARI) and reward reconstruction error (MSE). Across all settings, CoMI‑IRL outperforms fixed‑K baselines, achieving higher NMI/ARI scores and lower reward errors. Notably, performance remains stable even when K is under‑ or over‑specified, and the method efficiently incorporates new behaviors with minimal drift to existing clusters. Visualizations of the embedding space reveal clear separation of behavior modes, confirming interpretability.

In summary, CoMI‑IRL introduces (1) a Transformer‑based contrastive encoder that yields high‑quality, behavior‑centric embeddings, (2) a K‑free graph‑based clustering mechanism that discovers intent modes directly from data, (3) independent single‑intention IRL per cluster for flexible reward modeling, and (4) a lightweight adaptation scheme for unseen behaviors. This combination addresses the core limitations of prior MI‑IRL approaches and opens the door to scalable, interpretable reward inference from heterogeneous expert demonstrations in robotics, autonomous driving, gaming, and other domains where multiple intentions coexist.


Comments & Academic Discussion

Loading comments...

Leave a Comment