DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models
Accurate dialogue description in audiovisual video captioning is crucial for downstream understanding and generation tasks. However, existing models generally struggle to produce faithful dialogue descriptions within audiovisual captions. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions while maintaining strong overall performance. We first synthesize a high-quality dataset for SFT, then employ a difficulty-partitioned two-stage GRPO strategy to further enhance dialogue descriptions. To enable systematic evaluation of dialogue description capabilities, we introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.
💡 Research Summary
The paper introduces DiaDem, a novel audiovisual video captioning system that focuses on accurately describing dialogue—both transcribing utterances and attributing them to the correct speakers. While recent multimodal large language models (MLLMs) have achieved impressive overall caption quality, they often fail to capture “who said what,” especially in multi‑party or overlapping speech scenarios. To address this gap, the authors make three primary contributions.
First, they construct a high‑quality supervised fine‑tuning (SFT) corpus consisting of 70 K dialogue‑rich audiovisual captions and an additional 15 K general‑purpose captions. The dialogue data are generated by a pipeline that extracts speech segments, runs state‑of‑the‑art automatic speech recognition (ASR), and aligns the resulting transcripts with visual speaker cues (e.g., clothing, pose, facial features). Human verification is applied to a subset to keep error rates low, and speaker descriptions are explicitly embedded in the text so the model can learn to associate visual attributes with speaker identities.
Second, they propose a difficulty‑partitioned two‑stage GRPO (Gradual Reward‑Based Policy Optimization) strategy. A manually annotated set of 3 K dialogue samples is split into easy, medium, and hard groups based on factors such as number of speakers, utterance length, and speech overlap. In stage 1 the model is fine‑tuned on the easy subset with a high reward signal, establishing solid baseline dialogue understanding. In stage 2 the reward is re‑weighted to emphasize hard examples, and a policy‑gradient update further sharpens the model’s ability to separate overlapping utterances and correctly assign speakers. This staged approach allows the model to improve dialogue handling without degrading its overall captioning performance.
Third, the authors release DiaDemBench, a benchmark specifically designed to evaluate dialogue description quality in audiovisual captions. DiaDemBench contains 1,039 short video clips (≤ 20 s) covering a wide variety of dialogue settings: single‑shot vs. multi‑shot, varying numbers of speakers, overlapping speech, and multiple languages. Each clip is manually annotated with a sequence of (speaker description, utterance) tuples. The evaluation protocol consists of two metrics: (1) utterance transcription accuracy (ASR) measured by normalized Levenshtein similarity, and (2) speaker reference accuracy (REF) measured by a judge model (Gemini‑2.5‑Flash) that checks whether the speaker description in a predicted tuple matches the visual content of the video. To align predicted and ground‑truth dialogue lists, the authors employ a dynamic‑programming matcher that maximizes total similarity while allowing an adaptive merging of consecutive utterances from the same speaker. This merging mitigates mismatches caused by differing segmentation conventions between human annotators and model outputs.
Experimental results show that DiaDem achieves a total DiaDemBench score of 78.4 %, substantially outperforming Gemini‑3‑Pro (63.6 %) and Gemini‑2.5‑Pro (38.7 %). The improvement is especially pronounced in the REF component, where DiaDem gains roughly 12 percentage points, indicating far better speaker‑utterance pairing. On standard audiovisual captioning benchmarks such as ActivityNet‑Captions and YouCook2, DiaDem matches or slightly exceeds the performance of the underlying AVoCaDO model, confirming that the dialogue‑focused fine‑tuning does not sacrifice general caption quality. Human evaluations further corroborate these findings: participants rate DiaDem’s captions higher in both “speaker‑utterance consistency” and overall naturalness, with average scores of 4.3/5.
The paper also discusses limitations. When visual speaker cues are ambiguous (e.g., faces occluded), errors remain, suggesting a need for stronger multimodal speaker embeddings. The synthetic SFT data are predominantly Chinese and English, so cross‑lingual generalization requires further study. Moreover, the REF metric relies on a single judge model, which could introduce its own biases; future work may adopt ensemble judges for more robust assessment.
In conclusion, DiaDem demonstrates that targeted data synthesis, difficulty‑aware staged reinforcement learning, and a dedicated evaluation suite can dramatically improve dialogue description in audiovisual captioning. The introduced DiaDemBench provides a much‑needed benchmark for this under‑explored aspect of multimodal generation. The authors envision extensions to multilingual data, robustness to limited visual cues, and real‑time streaming scenarios, positioning DiaDem as a foundational step toward dialogue‑aware multimodal AI applications such as video summarization, interactive storytelling, and educational content generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment