Continual Robot Skill and Task Learning via Dialogue
Interactive robot learning is a challenging problem as the robot is present with human users who expect the robot to learn novel skills to solve novel tasks perpetually with sample efficiency. In this work we present a framework for robots to continually learn tasks and visuo-motor skills and query for novel skills via dialog interactions with human users. Our robot agent maintains a skill library, and uses an existing LLM to perform grounded dialog interactions to query unknown skills from real human users. We developed a novel visual-motor control policy Action Chunking Transformer with Low Rank Adaptation (ACT-LoRA) that can continually learn novel skills using only a few demonstrations which is critical in human-robot interaction scenarios. The paper has twin goals: Firstly to demonstrate better continual learning in simulation; and secondly, to demonstrate the use of our dialog based learning framework in a realistic human-robot interaction use case. Our ACT-LoRA policy consistently outperforms a GMM-LoRA baseline on multiple continual learning simulation benchmarks by achieving > 300% improvements on novel skills, while achieving comparable performance in existing skills. Moreover, with our IRB approved human-subjects study we demonstrate that our dialog based continual learning framework allows users to teach robots cooking skills successfully (100%) while spending a higher ratio of time on finishing an auxiliary distraction tasks in the test phase of the study compared to a non-learning language based agent (p < 0.001).
💡 Research Summary
The paper introduces a novel framework called COLADA (Continual Learning Aided by Dialog Agent) that enables a robot to continuously acquire new visuomotor skills through natural language dialogue with human users. The core idea is to let the robot recognize when it lacks a required skill, ask the human for a demonstration, and then learn that skill from only a few (typically three to five) examples. This addresses two major challenges in human‑robot interaction: (1) the robot’s ability to identify its own knowledge gaps and query for help, and (2) the sample‑efficiency needed for learning new motor behaviors in real‑time.
The system consists of three main components. First, a large language model (LLM) drives a dialogue state machine that converts high‑level user instructions into a sequence of low‑level motor primitives (τ). Second, a skill library stores previously learned skills together with their textual descriptions. Each description is embedded using CLIP’s text encoder, and the robot computes cosine similarity between the requested skill and all stored skills. If the similarity exceeds a pre‑set threshold (ϵ_text = 0.95), the matching skill is executed directly. Otherwise, the robot generates a natural‑language query (e.g., “Can you show me how to slice cheese?”) and asks the human for a tele‑operated demonstration. Third, the newly demonstrated data are used to fine‑tune a policy called ACT‑LoRA.
ACT‑LoRA builds on the Action Chunking Transformer (ACT), which predicts short action chunks from visual observations. To enable continual learning without catastrophic forgetting, the authors augment ACT with Low‑Rank Adaptation (LoRA) modules. The base transformer weights (ψ₀) are shared across all skills, while each skill has its own LoRA adapter (ψ_i). When a new skill is introduced, ψ₀ and all existing adapters are frozen; only the new adapter ψ_{K+1} is updated using a behavior‑cloning L1 loss on the few demonstration trajectories. This design preserves performance on previously learned skills while allowing rapid acquisition of new ones.
The authors evaluate the approach in two settings. In simulation, they use the RLBench benchmark with multiple continual‑learning scenarios. ACT‑LoRA achieves a 60.75 % ± 2.02 success rate on pre‑trained skills (1000 trajectories) and 77.67 % ± 9.36 when only five trajectories per new skill are available, outperforming a GMM‑LoRA baseline by more than 300 % on novel‑skill performance. In a real‑world user study, 16 non‑expert participants teach a robot to make two types of sandwiches (veggie and lettuce). The robot lacks “sprinkling pepper” and “spreading butter” actions, which participants demonstrate via a space‑mouse. COLADA attains an overall task success rate of 87.5 % and a 100 % success rate on the newly taught skills. Moreover, participants spent significantly more time on an auxiliary email‑writing task during evaluation (p < 0.001, Z = 3.61), indicating that the robot’s learning reduced the user’s burden. Subjective measures (Godspeed Likability, SUS, NASA‑TLX) also favored COLADA over two baselines: an “inarticulate” agent that never asks for help, and an “inverse‑semantics” agent that asks for help at every unknown skill.
Key contributions include: (1) a dialogue‑driven skill‑query mechanism that lets the robot explicitly acknowledge unknown capabilities; (2) the ACT‑LoRA policy that combines the fine‑grained control of ACT with the parameter‑efficient continual‑learning properties of LoRA; (3) empirical evidence of both simulation‑scale and real‑world effectiveness. Limitations are noted: reliance on CLIP text embeddings may cause mismatches when language is ambiguous, and the number of LoRA adapters grows linearly with the number of skills, potentially impacting memory and inference speed. Future work could explore multimodal embeddings, adapter compression, and richer dialogue strategies that incorporate clarification, feedback, and correction loops.
Comments & Academic Discussion
Loading comments...
Leave a Comment