From Videos to Conversations: Egocentric Instructions for Task Assistance
Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.
💡 Research Summary
The paper addresses a critical bottleneck in developing AI assistants for augmented‑reality (AR) procedural guidance: the lack of large‑scale multimodal conversational datasets that are grounded in real‑world egocentric video of task execution. Existing resources either provide only monologue‑style instructional videos (e.g., NIV, Ego4D) or limited two‑person interaction data that are costly to collect (e.g., HoloAssist). To overcome this, the authors propose a fully automatic “Monologue‑to‑Dialogue Conversion” (MDC) pipeline that transforms single‑person instructional videos into expert‑novice multi‑turn dialogues with turn‑level video grounding.
The pipeline consists of three stages. First, an “Instruction Formation” step uses a multimodal large language model (LLM), specifically Gemma‑3, to ingest video frames together with subtitles (or, for subtitle‑free videos, action timestamps) and generate a structured, step‑by‑step procedural specification. Because Gemma‑3 supports up to 128 K tokens, it can process the entire video transcript in one pass, preserving rich visual context.
Second, a “Dialogue Generation” step prompts the same LLM to simulate a teaching conversation: the expert explains each step, the novice asks clarification questions, makes mistakes, or requests the next instruction. The prompts explicitly encode user speech style (concise vs. verbose), possible error types (omission, addition, modification, slip, correction), and the phrasing of expert corrections. This results in a realistic multi‑turn dialogue where each user turn is paired with a corresponding expert response.
Third, a “Video Localization” step aligns each dialogue turn with the appropriate egocentric video segment. The alignment uses a rule‑based matching of step timestamps and leverages the LLM’s visual‑language grounding ability to verify that the selected clip indeed depicts the described action. The outcome is a one‑to‑one mapping between user utterances and short video clips (average length 12.5 seconds).
Using this pipeline, the authors construct HowToDIV, a new multimodal dataset built on top of the Narrated Instruction Videos (NIV) and EgoPER corpora. HowToDIV contains 507 sessions, 6,636 dialogue turns, and roughly 24 hours of egocentric video across three domains (cooking, mechanical repair, planting) and nine specific tasks. Sessions average 13 turns, with user utterances ranging from very brief (≈3 words) to more elaborate (≈11 words). Seventy‑five sessions deliberately include user errors, covering five error categories, and the expert responses provide explicit corrections.
Quality control combines human and automatic checks. Two annotators independently evaluated 175 randomly sampled turns for instruction correctness, dialogue naturalness, and video‑step alignment; 93.2 % of sampled turns were deemed usable. Automated filters removed duplicate, overly long/short, profane, or mis‑aligned turns, affecting less than 4 % of the generated data.
For benchmarking, the authors fine‑tune two open‑weight multimodal LLMs—Gemma‑3 and Qwen‑2.5—on HowToDIV and evaluate three tasks: (1) next‑step prediction, (2) answering user questions, and (3) correcting user errors. Standard metrics (BLEU, ROUGE) and an LLM‑as‑Judge approach are reported. Both models achieve procedural accuracy above 70 % and naturalness scores between 0.68 and 0.73, demonstrating that automatically generated data can support competent conversational agents.
The paper’s contributions are threefold: (1) a cost‑effective, fully automated pipeline for converting instructional monologues into grounded dialogues, (2) the release of HowToDIV, the first dataset that simultaneously provides procedural steps, multi‑turn expert‑novice dialogue, egocentric video grounding, and explicit error modeling, and (3) baseline performance results that establish a benchmark for future AR‑based task‑assistance research. The authors suggest future directions such as expanding to more domains, integrating real‑time action recognition for closed‑loop assistance, and combining human‑in‑the‑loop annotation to further improve data fidelity.
Comments & Academic Discussion
Loading comments...
Leave a Comment