MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio-Video Generation
Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or single-speaker settings. As a result, potential errors that occur in generated multi-talker dialogue videos, such as identity drift, unnatural turn transitions, and audio-visual misalignment, cannot be effectively captured and analyzed. To address this issue, we introduce MTAVG-Bench, a benchmark for evaluating audio-visual multi-speaker dialogue generation. MTAVG-Bench is built via a semi-automatic pipeline, where 1.8k videos are generated using multiple popular models with carefully designed prompts, yielding 2.4k manually annotated QA pairs. The benchmark evaluates multi-speaker dialogue generation at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. We benchmark 12 proprietary and open-source omni-models on MTAVG-Bench, with Gemini 3 Pro achieving the strongest overall performance, while leading open-source models remain competitive in signal fidelity and consistency. Overall, MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement.
💡 Research Summary
The paper introduces MTAVG‑Bench, a comprehensive benchmark specifically designed to evaluate text‑to‑audio‑video (T2AV) models that generate multi‑speaker dialogue videos. Existing benchmarks focus on human‑recorded content or single‑speaker scenarios and therefore miss critical failure modes unique to multi‑talker dialogue, such as identity drift, unnatural turn transitions, and audio‑visual misalignment. MTAVG‑Bench addresses this gap through a semi‑automatic pipeline that creates 1.8 k synthetic videos using twelve popular T2AV systems, then annotates them with 2.4 k high‑quality question‑answer (QA) pairs.
The benchmark is organized into four hierarchical evaluation levels, each comprising several fine‑grained dimensions:
- Signal Fidelity – assesses low‑level perceptual quality of video (VQ) and speech (SQ).
- Attribute Consistency – checks scene continuity (SC), speaker identity stability (CC), and lip‑speech synchronization (LS).
- Social Interaction – evaluates correct mapping of utterances to speakers (SA) and logical turn‑taking (TT).
- Cinematic Expression – measures affective‑expressive alignment (EA) and speaker‑centric camera framing (CA).
For each dimension, the authors generate diagnostic multiple‑choice or pairwise‑preference questions, ensuring that evaluation captures not only scalar scores but also the specific error type. An automated agent first filters out videos that appear error‑free; the remaining samples, each containing at least one observable failure, are reviewed by human annotators who map failures to the appropriate dimensions and craft QA items with LLM assistance.
The authors benchmark twelve models, including commercial systems (Gemini 3 Pro, VEO 3, Sora 2, WAN 2.5) and leading open‑source omni‑models (Stable‑Video‑Diffusion, Open‑Sora, LDM‑Audio, etc.). Results show that Gemini 3 Pro achieves the highest overall score, particularly excelling in Signal Fidelity and Attribute Consistency. However, all models perform poorly on Social Interaction and Cinematic Expression, with average scores below 40 % in these high‑level categories. Open‑source models are competitive on low‑level metrics but lag significantly on turn‑taking logic and affective alignment, highlighting the difficulty of learning coherent dialogue structure and expressive cinematography from current training data.
Key contributions of the work are:
- Introduction of the first benchmark dedicated to multi‑speaker dialogue generation, filling a critical evaluation void.
- A hierarchical taxonomy that progresses from perceptual fidelity to cinematic coherence, enabling nuanced diagnostics of where models fail.
- A publicly released dataset of failure‑focused videos and 2.4 k QA pairs, providing a valuable resource for model debugging, targeted fine‑tuning, and future research on dialogue‑aware generation.
The analysis reveals that while modern T2AV systems can produce visually realistic footage, they still lack robust semantic reasoning across modalities required for coherent multi‑turn conversations and expressive storytelling. MTAVG‑Bench therefore serves both as a diagnostic tool and as a roadmap for the next generation of models that must jointly master audio‑visual fidelity, speaker consistency, dialogue logic, and cinematic expression.
Comments & Academic Discussion
Loading comments...
Leave a Comment