EgoAVU: Egocentric Audio-Visual Understanding
Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
💡 Research Summary
The paper introduces EgoAVU, a fully automated data engine designed to address the scarcity of high‑quality multimodal annotations for egocentric video understanding. Existing egocentric datasets such as Ego4D provide human‑written narrations that focus mainly on visual actions and largely ignore auditory cues, while current multimodal large language models (MLLMs) are biased toward visual inputs and struggle to correctly align sounds with their visual sources. EgoAVU tackles these issues through a four‑stage pipeline. First, it enriches the original narrations by separately generating detailed image captions (using Qwen2.5‑VL) and audio captions, thereby creating a rich multimodal context. Second, token‑based video filtering selects clips that exhibit diverse and strong audio‑visual dynamics, ensuring a varied training set. Third, a Multimodal Context Graph (MCG) is constructed to capture relationships among objects, actions, and sounds; open‑source LLMs parse this graph to produce coherent, dense audio‑visual narrations. Fourth, these narrations are used to automatically generate question‑answer pairs covering grounding, temporal reasoning, scene understanding, and hallucination detection. The resulting 3 million QA samples form the EgoAVU‑Instruct training corpus, while a manually verified 3 k sample subset constitutes the EgoAVU‑Bench benchmark. Experiments reveal that off‑the‑shelf MLLMs (e.g., Qwen2.5‑Omni, VideoLLaMA2) heavily favor visual cues, often omitting or mis‑attributing audio information. Fine‑tuning these models on EgoAVU‑Instruct dramatically improves performance, achieving up to a 113 % relative gain on EgoAVU‑Bench. Moreover, the fine‑tuned models transfer to other egocentric benchmarks such as EgoTempo and EgoIllusion, delivering up to 28 % relative improvements. All components rely solely on open‑source models, and the authors plan to release code and datasets publicly. EgoAVU thus provides a scalable solution for generating high‑quality audio‑visual language data, paving the way for more balanced multimodal reasoning in embodied AI applications like augmented reality, robotics, and human‑computer interaction.
Comments & Academic Discussion
Loading comments...
Leave a Comment