인디고 산업 현장 멀티모달 데이터셋

Reading time: 5 minute
...

📝 Abstract

We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

💡 Analysis

We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

📄 Content

Egocentric Vision and AI are gaining significant attention, driven by both their economic potential and the recent development of general-purpose foundational models [1,2,3,4]. One of the key goals Figure 2: A scenario of a disassembly process from the IndEgo dataset. The two participants work collaboratively on the task. The semi-dense point cloud and the user trajectories are generated by processing the raw data from the Aria device [4,11]. The egocentric perspective of the two participants with the projected eye gaze point can be seen in relation to the 3D environment and the exocentric view. Bottom: The annotations from each worker’s perspective, and the keysteps in the process. Right: The corresponding task graph for the procedure. The flow of activities is from top to bottom, and dependencies are shown with an arrow.

denotes labour-intensive steps.

of this field is to develop helpful AI assistants that understand user’s actions, intentions, and needs, and provide valuable guidance [5,4]. Such assistants would be valuable in learning and acquiring new skills, navigating new environments, and improving user experience. Additionally, egocentric research is also relevant to the development of embodied agents that can learn from demonstrations, assist and collaborate with humans [6,7,8,9,10].

To enable and accelerate research in these areas, several datasets and benchmarks have been introduced [12,13,3]. Current datasets and initiatives focus heavily on daily activities and procedures [12,13,14]. This is in part because these are applicable across different cultures and lifestyles, and also because it is relatively easier to design experiments and collect data in settings such as the kitchen [13,15,16]. Some recent datasets also focus on industry-like contexts [17,18], however, the datasets published in true industrial settings remain low. Industrial scenarios offer a rich domain for egocentric vision research due to their complexity, diversity of tasks, and dynamic environments [18,17]. Workers in industrial settings perform intricate manual operations, manipulate various tools and components, and navigate in cluttered and unpredictable spaces [19,20]. These conditions present unique challenges for egocentric vision systems, such as accurately recognising actions and gestures in real-time, identifying and localising tools and parts amidst visual occlusions, and adapting to varying lighting and environmental conditions. These scenarios are underrepresented in current egocentric vision datasets and research.

Egocentric Vision-based assistive technologies can offer several unique and practical implications for improving productivity, safety, and efficiency in industrial operations. To investigate the future application potential of wearable egocentric vision-based assistants, we present the IndEgo dataset, consisting of diverse Industrial tasks and processes from an Egocentric perspective, with a supplementary Exocentric view for reference in several cases. IndEgo comprises 3,460 unique egocentric video recordings (totalling 197.1 hours), and 1,092 (totalling 96.8 hours) accompanying exocentric video recordings. The dataset includes scripted and unscripted actions from participants in a dedicated industrial facility, collected in diverse conditions (lighting, time of the day, setup). Figure 1 shows an example. The dataset also includes collaborative work, where two participants work together on a common task in various roles (partners, teacher-student, leader-assistant). AI assistants and embodied agents of the near future will work alongside other humans on such tasks, which makes this a relevant subdomain for further exploration and research.

Frequency Cumulative Duration (s) Figure 3: Grouped bar charts of frequencies (left axis) and durations (right axis) for the fine-grained action annotations: Top 20 nouns (left), verbs (middle), and adjectives (right). Our dataset covers diverse industrial contexts, which are not represented by current egocentric/exocentric datasets. This highlights the multimodality and human-centric attributes of IndEgo. Table 1: A breakdown of the IndEgo dataset, showing the key categories and related statistics. T avg gives the average duration of the recording, #Ego gives the number of videos from the Egocentric perspective, T Ego gives the total cumulative time for egocentric data, #Exo gives the number of videos from the fixed exocentric perspective, T Exo gives the total cumulative time for exocentric data.

PC cabinets, and proprietary assemblies. Figure 9 shows some examples. A detailed description is available in Supplementary Materials.

Participants. The data was collected by 20 participants, 15 male and 5 female. The selection was done in an unbiased manner, and the ratio reflects the natural skew seen in industrial work. They have varying degrees of experience in industrial skilled work (beginner to expert), and come from different nationalities and ethnic backgrou

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut