See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

December 01, 2025

Reading time: 5 minute

...

📝 Original Info

Title: See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
ArXiv ID: 2512.02231
Date: 2025-12-01
Authors: Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

📝 Abstract

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

💡 Deep Analysis

📄 Full Content

Multimodal large language models (MLLMs) have rapidly evolved in recent years, extending language models into image [7,35,54,69], video [34,54,66], and audio understanding [18,28,64]. As this evolution continues, recent efforts have moved beyond pairwise modality fusion toward building omni-models that aim to jointly process vision, audio, and language in a unified architecture [22,51,59,60]. Such capability is essential for real-world applications like * Equal Contribution Figure 1. Motivation of AV-SpeakerBench. Existing video benchmarks often contain visually solvable questions-such as counting visible people-where state-of-the-art multimodal models can answer correctly even when the audio stream is muted (left; examples from Video-MME [13]). In contrast, questions in AV-SpeakerBench (right) are explicitly designed to require audiovisual fusion: the correct answer depends on who speaks, when they speak, and how speech events unfold over time.

video dialog agents, meeting transcription systems, and human-AI interaction platforms, where the model must see, hear, and reason over multiple signals simultaneously. However, evaluating whether current models can truly integrate multiple modalities-rather than treat one as the dominant source-remains an open challenge.

In particular, audiovisual speaker perception has been a long-standing research problem [26,27,46,47,49,55], yet modern MLLMs are rarely evaluated on this ability. This gap arises for two reasons. First, existing speaker datasets are built around closed-set labels or framelevel supervision, making them incompatible with openended, language-based evaluation. Second, current video QA benchmarks seldom focus on speaker-level reasoning: many questions can be solved using a single modality [13,52], while others emphasize coarse audio-visual matching or non-speech acoustic events [30,61,70]. As a result, current benchmarks do not systematically evaluate whether multimodal models can jointly determine who is speaking, what is being said, and in what visual context.

To address this gap, we introduce AV-SpeakerBench, a new benchmark for evaluating fine-grained audiovisual reasoning centered on human speakers in real-world video. AV-SpeakerBench is explicitly designed to test whether multimodal models can jointly interpret visual, auditory, and linguistic information at the speaker level-a capability not captured by existing datasets. Below, we outline the benchmark’s key design principles.

Speaker-centric task formulation. AV-SpeakerBench frames every question around the speaker as the fundamental reasoning unit, shifting evaluation from scene-level understanding to human-centered audiovisual grounding. Each video includes multiple visible individuals, enabling identity-based questions that require models to determine who speaks, when, and under which visual context. By spanning diverse conversational settings and speaker configurations-factors known to increase difficulty in audiovisual perception [26,46,47,49,55]-the benchmark tests whether models can reliably resolve speaker behavior amid visually complex and varied conversational dynamics.

Fusion-driven question design. AV-SpeakerBench uses a four-choice MCQ format in which auditory and visual cues are jointly encoded in both questions and answer options. This design ensures that solving each item requires cross-modal fusion-for example, associating spoken phrases with visible speakers, interpreting speech in relation to visual events, or resolving multiple voices in a shared scene. Together, these constructions reflect how audiovisual understanding naturally relies on coordinating what is heard with what is seen.

High-quality human annotation. All clips are manually selected and temporally segmented to isolate moments where speech-driven interaction occurs. Annotators then identify segments that genuinely require audiovisual reasoning-such as aligning an utterance with the correct visible speaker-and compose MCQs grounded in these spans. Each clip and question undergoes multi-stage expert review to ensure temporal precision and cross-modal validity.

Dataset summary and evaluation scope. Our IRBapproved AV-SpeakerBench contains 3,212 curated question-answer pairs across 12 task types, all centered on speakers as the core reasoning unit. These tasks span temporal localization, speaker identification, speech-content reasoning, utterance counting, paralinguistic attribute comparison and so on, collectively evaluating a model’s ability to integrate recognition, grounding, and temporal reasoning across audio and vision rather than depend on static or unimodal cues.

Experimental findings and implications. Comprehensive evaluation across open-source and proprietary models reveals a clear performance gap between current MLLMs and human accuracy. Among open-source models, only recent omni-directional approaches-such as Qwen3-Omni [60]-show substantial progress, with the 30B variant reaching parity with G

📄 Read Full PDF on ArXiv