SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

So cial Omni : Benchma rking A udio-Visual So cial Interactivit y in Omni Mo dels Tianyu Xie 1,2 , Jinfa Huang 5 , Y uexiao Ma 1,3 , Rongfang Luo 4 Y an Y ang 4 , W ang Chen 1,2 , Y uhui Zeng 1,2 , R uize Fang 1,2 Yixuan Zou 1,2 , Xia wu Zheng 1,2, B , Jieb o Luo 5 , Rongrong Ji 1,2,3 1 Media Analytics and Computing Lab, Xiamen Universit y , Xiamen, China 2 Institute of Artificial Intelligence, Xiamen Universit y , Xiamen, China 3 Scho ol of Info rmatics, Xiamen Universit y , Xiamen, China 4 Sichuan Agricultural Universit y , Y aan, China 5 Depa rtment of Computer Science, Universit y of Ro che ster, Ro c hester, NY, USA B Co rresp onding Autho r ABSTRA CT Omni-mo dal large language mo dels (OLMs) redeﬁne h uman-mac hine interaction b y natively in tegrating audio, vision, and text. How ever, existing OLM b enc hmarks remain anchored to static, accuracy-cen tric tasks, lea ving a critical gap in assessing so cial interactivit y , the fundamen tal capacit y to navigate dynamic cues in natural dialogues. T o this end, we prop ose So cialOmni , a comprehensiv e b enchmark that op erationalizes the ev aluation of this conv ersational interactivit y across three core dimensions: (i) sp eaker separation and iden tiﬁcation ( who is sp eaking), (ii) in terruption timing control ( when to interject), and (iii) natural interruption generation ( how to phrase the interruption). So cialOmni features 2,000 p erception samples and a quality-con trolled diagnostic set of 209 interaction-generation instances with strict temp oral and contextual con- strain ts, complemented b y controlled audio-visual inconsistency scenarios to test model robustness. W e b enchmark ed 12 leading OLMs, which uncov ers signiﬁcant v ariance in their so cial-interaction capabilities across mo dels. F urthermore, our analysis reveals a pronounced decoupling b etw een a mo del’s p erceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insuﬃcient to characterize con v ersational so cial comp etence. More encouragingly , these diagnostics from our So cialOmni yield actionable signals for bridging the p erception-interaction divide in future OLMs. Email: Tian yu Xie teery@stu.xmu.edu.cn Project P age: github.com/MAC- AutoML/SocialOmni Data: huggingface.co/datasets/alexisty/SocialOmni 1 Intro duction Omni-mo dal large language mo dels (OLMs) supp ort real-time multimodal conv ersation by contin uously in tegrating audio, vision, and text within a uniﬁed generation lo op [ 7 , 10 , 18 , 25 , 50 , 51 ]. In such settings, success dep ends not only on pro duci ng correct con ten t but also on genuine in teraction comp etence: p erceiving and responding to dynamic dialog cues, deciding when to sp eak, and generating socially coheren t responses. As summarized in T able 1 , existing OLM b enc hmarks remain anchored to static, accuracy-centric understanding tasks [ 14 , 22 , 26 , 57 ], leaving this interaction capabilit y largely unev aluated. This gap motiv ates benchmarks that ev aluate interaction comp etence beyond mere answ er correctness. Ho wev er, as shown in T able 1 , existing ev aluation paradigms remain insuﬃcient to capture the full sp ectrum 1 T able 1. P ositioning of OLM b enchmarks under a so cial-interactivit y lens. W e compare existing represen tativ e b enchmarks b y whether they explicitly op erationalize who (sp eaker iden tiﬁcation), when (turn timing), how (interruption generation), robustness to conﬂict (audio–visual inconsistency), and temp oral granularit y of ev aluation. ( ✓ : explicitly ev aluated by task design; ~ : partially cov ered via indirect proxies, e.g., QA/lo calization outcomes; ✗ : not explicitly ev aluated. Granularit y: •◦◦◦ ranging from Global-level to •••• F rame-level.) Benc hmark T yp es Who When Ho w Conﬂict T emp oral Gran ularity OmniBenc h [ 26 ] Understanding ✗ ✗ ✗ ✗ •◦◦◦ OmniVideoBenc h [ 22 ] Understanding ~ ~ ✗ ✗ ••◦◦ W orldSense [ 14 ] Understanding ~ ~ ✗ ✗ ••◦◦ OmniEv al [ 57 ] Understanding ~ ~ ✗ ✗ •◦◦◦ Daily-Omni [ 61 ] Understanding ~ ~ ✗ ✗ ••◦◦ Join tA VBench [ 3 ] Understanding ~ ~ ✗ ✗ ••◦◦ OmniMMI [ 47 ] In teraction ~ ~ ~ ✗ •••◦ Omni-Safet yBenc h [ 39 ] Understanding ✗ ✗ ✗ ~ •◦◦◦ So cialOmni (Ours) In teraction ✓ ✓ ✓ ✓ •••• of dynamic conv ersational abilities. Prior work can b e broadly categorized into tw o groups. Answ er-cen tric b enc hmarks fo cus on what a mo del knows by p osing static question-answering or retriev al tasks ov er pre- segmen ted audio-visual clips [ 14 , 22 , 26 , 57 ], measuring prop ositional accuracy alone. While eﬀective for isolating p erceptual and reasoning skills, these b enchmarks treat queries indep endently and thus fail to assess coheren t understanding across multi-turn dialogues, neglecting crucial conv ersational dynamics. In contrast, b eha vior-cen tric b enchmarks explore ho w mo dels act within context, probing skills such as multi-speaker p erception [ 3 , 20 ], so cially grounded reasoning [ 6 , 38 ], or daily conv ersational inference [ 61 ]. Although these b enc hmarks adv ance b ey ond answ er correctness by targeting interactiv e b ehaviors, they typically isolate single facets—e.g., sp eak er diarization or emotion recognition—without simultaneous ev aluation of p erception, reasoning, and so cial appropriateness. Consequently , neither family suﬃciently addresses the integrated, m ultimo dal, and so cial complexities inherent to real-world dialogue, where mo dels must in terpret evolving con text, understand multimodal cues, and resp ond coherently and appropriately in real time. This limitation is consequential. In live dialogue, u tilit y critically dep ends jointly on semantic correctness and social timing: a dela yed turn entry , a premature interruption, or an incoheren t topic contin uation can each substantially degrade user exp erience even when the prop ositional con ten t is accurate [ 44 ]. If ev aluation remains ﬁxated on correctness alone, mo del selection will inevitably systematically ov er-reward oﬄine comprehension while under-p enalizing such interaction failures [ 52 ]. T o close this gap, we prop ose So cialOmni , a b enchmark that op erationalizes so cial in teractivit y ev aluation across three core dimensions: So cialOmni: Three Dimensions of So cial Interactivit y Who (sp eaker identiﬁcation) : iden tifying sp eak ers b y integrating m ultimo dal information including visual cues, acoustic features, and contextual dialogue history across multiple sp eakers. When (in terruption timing control) : determining optimal timing and strategy for interruptions b y analyzing dialogue dynamics and turn-taking patterns in real-time. Ho w (natural interruption generation) : pro ducing a resp onse ﬁtting the ongoing dialogue context while maintaining coherence with sp eaker inten t and conv ersation ﬂow. A ccordingly , So cialOmni comprehensively tests the end-to-end pip eline from precise audio-visual grounding to turn-en try decision and then to adaptiv e on-the-ﬂy contin uation under strict latency constraints. Bey ond deﬁning ev aluation targets, these dimensions exp ose fundamental concrete architectural challenges for current OLMs: Who requires ﬁne-grained audio-visual alignmen t b eyond the temp oral granularit y of most video enco ders; When demands nuanced fusion of proso dic, lexical, and visual turn-taking cues under dynamically shifting salience; and How stresses robust real-time generation of contextually grounded contin uations under cross-mo dal attention and latency constraints. So cialOmni comprises 2,000 p erception samples and a quality- con trolled diagnostic set of 209 interaction-generation instances across 15 dialogue domains, with systematic 2 con trolled audio-visual inconsistency scenarios designed to prob e robustnes s under cross-mo dal conﬂict. W e ev aluate 12 OLMs and observe t w o recurring patterns. First, mo dels exhibit systematic markedly diﬀerent error proﬁles across who – when – how , indicating that substantial gains on one axis do not imply robustness on the others. Second, we observe a pronounced decoupling b etw een p erceptual accuracy and interruption- generation qualit y: mo dels that excel at sp eaker identiﬁcation do not alwa ys pro duce natural in terruptions. These results sho w that understanding-cen tric b enchmarks alone are fundamen tally insuﬃcient to characterize con versational so cial comp etence, motiv ating dedicated interaction-orien ted ev aluation. Our contributions are threefold: i) New Omni Mo dels Benc hmark. W e in tro duce So cialOmni, a comprehensive b enc hmark for ev aluating audio-visual so cial interaction understanding along three axes: who , when , and how . ii) New Dual-Axis Ev aluation Proto col. W e prop ose a proto col that couples frame-level p erception diagnosis with multi-judge generation scoring, enabling p erception-generation decoupling analysis. iii) New Robustness Prob es. W e design controlled mismatch prob es that systematically quantify mo del robustness and generalization under realistic audio-visual conﬂict scenarios. 2 Related W ork Omni-Mo dal Large Language Mo dels. Multimo dal mo deling has rapidly evolv ed from p erception- cen tric paradigms such as CLIP [ 40 ] and instruction-tuned VLMs lik e Flamingo and LLaV A [ 1 , 29 ] to ward omni-mo dal large language mo dels (OLMs) that natively couple text, vision, and audio within a uniﬁed in teraction loop [ 7 , 10 , 18 , 25 , 48 , 50 , 51 ]. Recent studies on m ultimo dal perception and represen tation learning also broaden the design space of mo dern MLLMs [ 23 , 55 , 56 ]. F rom a system-design p ersp ective, diverse OLM stacks range from dispatch designs, where a central LLM orchestrates external ASR, V AD, diarization, and visual grounding mo dules, to native designs that tighten cross-mo dal coupling inside a single generation lo op [ 7 , 10 , 12 , 13 , 18 , 25 , 49 , 50 , 52 , 54 ]. Sim ultaneously , scalable deploymen t motiv ates parallel work on adaptation, pruning, quantization, and eﬃciency optimization for large mo dels [ 15 – 17 , 34 – 37 , 59 ]. Y et even highly capable systems remain largely ev aluated under turn-lev el request–resp onse proto cols, leaving op en whether they can proactively decide when to enter a conv ersation, who to address in m ulti-party settings, and ho w to realize interruptions in so cially coherent wa ys, as highlighted b y classical turn-taking theory [ 42 , 44 ]. Moreo v er, common architectural choices suc h as sparse temp oral sampling, coarse cross-mo dal alignment, and turn-lev el segmen tation systematic ally mask timing errors surfacing only under ﬁne-grained, real-time conditions, an issue closely related to recent eﬀorts on seman tic-b oundary-based frame selection, even t-anchored sampling, query-orien ted tok en budgeting, and retriev al-augmen ted long-video comprehension [ 4 , 5 , 32 , 33 ], thereb y motiv ating b enchmarks explicitly testing interaction comp etence b eyond answer correctness. Answ er-Centric Benchmarks for OLMs. Comprehensiv e broad-cov erage omni suites and mo dality- sp eciﬁc understanding b enchmarks [ 9 , 14 , 21 , 22 , 24 , 26 , 31 , 45 , 53 , 57 , 60 ] ev aluate what a mo del kno ws by p osing question-answering or retriev al tasks ov er pre-segmented multimodal stim uli and measuring factual prop ositional accuracy of the resp onse. Cross-mo dal QA suites [ 26 , 57 ] pair audio-visual clips with factual questions and score mo dels on answer correctness under uniﬁed metrics, while domain-sp eciﬁc b enchmarks suc h as MMMU [ 53 ] and AudioBench [ 45 ] prob e exp ert-level comprehension within individual mo dalities through m ultiple-choice or op en-ended question answering, again using answer accuracy as the sole ev aluation signal. Video understanding b enchmarks [ 9 , 22 , 24 , 60 ] extend this paradigm to temp oral reasoning by querying ev en t ordering or causal relations across frames, yet still treat each question as an isolated, single-turn trial. Although these eﬀorts hav e substan tially expanded p erceptual and reasoning cov erage, they share a fundamen tal common structural limitation: ev aluation is conﬁned to static prompt-resp onse pairs and do es not enforce temp oral alignment at frame level, turn-entry decisions, or interruption handling within an unfolding dialogue. Consequently , strong answ er accuracy do es not imply reliable interaction b ehavior under real-time, multi-part y constraints, leaving the b ehavior-cen tric dimension largely unaddressed. Beha vior-Centric Benchmarks for OLMs. Recen t eﬀorts hav e systematically b egun to prob e inter- activ e b ehavior. So cial reasoning b enchmarks [ 20 , 38 ] target multi-speaker inference and so cial attribute understanding, y et do not ev aluate turn-entry timing or in terruption strategy . Sp oken-dialogue and full-duplex 3 c o n s i s t e n t i n c o n s i s t e n t Sports Education Real Daily General Food Travel Fashion Technology Health Art Others 394 336 286 218 152 136 110 96 84 78 70 64 56 50 70 2209 V ideos 15 Domains 2 Tasks (a) SocialOmni Benchmark Overview Entertainment T ask I: Who — Perception T ask II: When & How — Generation At timestamp [t] , who is speaking: (A) [Person description 1] : Person Confusion (B) [Person description 2] : Person Confusion (C) [Alternative] (D) [Alternative] V ideo Context: (from 0.s - t .s) Question 1:Should speak at [t] ? score continue V ideo Context: (full) Question 2:Should say Sth? (role-playing ) (scoring) (b) T ask Design (c) Models Performance omni models T ask I T ask II Figure 1. Ov erview of So cialOmni. (a) Benchmark data distribution across 15 sub categories and four domains, with consistent/inconsisten t stratiﬁcation and p erception/generation task splits. (b) Overview of the proposed ev aluation tasks and metrics. (c) Performance comparison of 12 OLMs on b oth T ask I and T ask I I. b enc hmarks [ 2 , 19 , 27 , 28 , 43 , 46 ] emphasize turn-taking timing and interruption detection, yet predominan tly often op erate under audio-only stim uli with limited sp eaker grounding or multimodal conﬂict control. Mul- timo dal interaction b enchmarks [ 3 , 6 , 39 , 47 , 61 ] introduce joint audio-visual conv ersational settings, but frequen tly lack frame-level temp oral sup ervision and diagnostic control of cross-mo dal conﬂict. Although eac h line of work adv ances one facet, to the b est of our knowledge, no existing b enchmark simultaneously op erationalizes the integrated triad required for full-duplex multi-part y conv ersation: sp eaker attribution (who), turn-entry decision (when), and interruption realization (ho w). In real dialogue, the three are causally en tangled, a correct presupp osition of who, and the appropriateness of how dep ends on b oth. Therefore, ev aluating them in isolation can signiﬁcan tly systematically ov erestimate interactiv e comp etence b y masking failure cascades. This means that the joint, integrated ev aluation of multi-part y interaction comp etence under ﬁne-grained temp oral alignment and controlled cross-mo dal conﬂict remains an op en problem. 3 So cialOmni: Evaluating Omni-Mo dal Multi-P a rt y Interactivity W e prop ose SocialOmni, a comprehensive b enc hmark for ev aluating the so cial interactivit y of omni-modal large language mo dels (OLM) in m ulti-party conv ersational settings. Unlike traditional existing video-understanding b enc hmarks that treat the mo del as a passive observer, So cialOmni requires join tly recognizing who is speaking, judging when to take the ﬂo or, and deciding how to resp ond—three tightly coupled fundamental abilities that underpin natural con versation yet hav e systematically not b een assessed in a uniﬁed framework. In what follows, we ﬁrst in tro duce how the b enchmark is constructed and curated (§ 3.1 – 3.2 ), then formalize the uniﬁed who – when – how task design (§ 3.3 ) and the accompanying ev aluation proto col (§ 3.4 ). 3.1 Benchma rk Construction Ev aluating the so cial in teractivit y of OLM rigorously requires dialogue videos that span a wide sp ectrum of conv ersational types while maintaining high audio-visual quality and appropriate redistribution licenses. W e ﬁrst compile a search-term database targeting diverse multi-part y dialogue scenarios, systematically rank terms b y the v olume of retriev able videos on public platforms with CC-BY-compatible licenses, and retain only those yielding suﬃcient results of high pro duction quality . This pro cedure pro duces 15 dialogue sub categories organized into four domains: En tertainment, Sp orts, Art, and F ashion under the Entertainmen t domain; Business, T echnology , Education, and General under the Professional domain; Daily , F o o d, T rav el, and Health 4 under the Daily Life domain; and Emotion, Real, and Others under the Narrativ e domain. In total, we cra wl ov er 3,000 raw videos across these sub categories. Eigh t trained annotators indep endently review every video and extract segments of 10–30,s containing clear multi-part y dialogue. Each clip is assigned to the p erception or generation task according to the criteria detailed in § 3.3 . After stringent ﬁltering for audio clarit y , face visibility , and turn-structure quality , 2,209 clips survive with a mean duration of 25.0,s. W e then apply Whisp er [ 41 ] and F unASR [ 11 ] to every surviving clip to obtain automatic transcripts, whic h serve dual purp oses: they provide essen tial raw material for constructing p erception answ er options and act as reference text for ev aluating generation quality . Prompt templates and parsing rules app ear in App endix A.14 . 3.2 Statistics and Quality Control T able 1 systematically compares the scale and scop e of So cialOmni with prior b enc hmarks. The b enchmark comprises 2,209 ev aluation instances divided into tw o complementary splits: the p erception split carefully con tains 2,000 multiple-c hoice questions (1,725 consistent and 275 inconsistent), while the generation split pro vide s 209 op en-ended items, each accompanied by m ulti-reference resp onses. As shown in Fig. 1 (a), the 15 sub categories are delib erately balanced so that no single conv ersational style dominates: the General category contributes the most clips (394) and F ashion the fewest (70). The generation subset is inten tionally k ept compact to maintain manageable v ariance in op en-ended judging, yet it preserves full domain cov erage. Substan tial inter-annotator agreemen t reaches 94.2% on the p erception split and 91.8% on the generation split, conﬁrming high annotation reliabilit y . F ull agreemen t statistics, a size-rationale analysis for the generation split, and complete sub category deﬁnitions app ear in App endices A.1 , A.5 , and A.2 , resp ectively . 3.3 T ask Design So cialOmni frames real-time multi-part y interaction as a uniﬁed who – when – how problem. Recognizing who is sp eaking at a given moment is fundamentally a p erceptual ability , whereas deciding when to take the ﬂo or and how to resp ond demands genuine generative interaction. W e therefore op erationalize the b enchmark through tw o complementary tasks that together cov er the full arc of a conv ersational turn. T ask I: Who — Perception. This task accurately ev aluates the ability to identify the active sp eaker at timestamp t within video V and audio A . Candidate choices are systematically synthesized by p ermuting tw o orthogonal axes—sp eaker identit y and textual conten t—deriv ed automatically from the ASR transcripts. The resulting comprehensiv e four-wa y classiﬁcation includes the ground truth (correct sp eaker, correct conten t) alongside three distractors: wrong sp eaker with correct conten t, correct sp eaker with wrong conten t, and wrong sp eaker with wrong conten t. This design eﬀectively decouples errors in visual grounding from errors in sp eec h recognition. Each clip is carefully additionally lab eled as consisten t (the on-screen p erson matches the audio source) or inconsistent (the camera shows a diﬀeren t p erson), enabling ﬁne-grained diagnosis of robust robustness to cross-mo dal mismatch. Representativ e examples of b oth types app ear in Fig. 2 . T ask I I: When & How — Generation. Giv en a video preﬁx V ≤ t with the corresp onding audio preﬁx A ≤ t , the mo del ﬁrst addresses when to sp eak, whic h is a binary turn-taking decision at timestamp t . If the answ er is aﬃrmative, how to resp ond by generating a context-appropriate utterance. Clips for this task are selected under stricter criteria: sp eaker turns must alternate with suﬃcient clarity for a human observer to pinp oin t transition b oundaries unam biguously . The annotated b oundaries serve as ground truth for the when sub-question, and each clip is paired with multi-reference contin uations to supp ort robust ev aluation of the how sub-question. All annotations undergo tw o rounds of adjudication—indep endent lab eling follo w ed by cross-review, to ensure consistency . F urther details app ear in App endix A.9 . 3.4 Evaluation Metrics W e design ev aluation metrics for each of the three axes. F or who , we use top-1 accuracy and macro-F1; for when , w e measure the signed resp onse oﬀset and assign each prediction to one of ﬁve timing categories; for how , w e adopt an LLM-as-a-judge score. Perception is ev aluated indep endently , while timing and resp onse qualit y are ev aluated jointly: the mo del ﬁrst decides when to sp eak, and only then is its resp onse scored. P erception metric ( who ). The perception split con tains N p = 2 , 000 clips. Each clip is paired with a query timestamp and four candidate descriptions of who is saying what at that moment; the mo del selects the correct one, and we rep ort top-1 accuracy with non-parsable outputs counted as incorrect. Because the b enchmark 5 Timestamp 3s Audio Context (Sarah voice: Mike, you never listen!) Zone1 V ideo Context (Mike Right, closed lips) Zone2 V isual: Mike (Right, Closed lips) Audio: Sarah (Left, V oice active) Audio-V ision Inconsistent V isual:Sarah (Left, Speaking) Audio: (Left, V oice active) Ground T ruth T ext: Mike, you never listen! Next-T urn Generation T urn-entry Decision Y es No Ground T ru th: “I‘m busy right no w .” OLM Response:Wait, I was just ... OLM Response:None GPT Gemini Qwen LLM Judges Generative T ask Zone3 A B C D A: Person Right (Closed lips) B: Person Left (Active voice) C: Person Middle D: Person Right (Mis-attribution) A u d i o - V i s i o n c o n s i s t e n t T ask I T ask II At timestamp [3s] ,w ho is speaking? Figure 2. Illustration of the So cialOmni ev aluation pip eline. Given a multi-modal conv ersation stream (Zone 1), So cialOmni constructs b oth audio-vision inconsistent and consistent consistent (Zone 2), then ev aluates mo dels on sp eaker perception (T ask I) and turn-entry generation (T ask I I) with LLM-based judging (Zone 3). delib erately includes b oth consisten t clips ( N cons = 1 , 725 ) and inconsistent clips ( N incons = 275 ), w e also report accuracy on each subset separately . Their diﬀerence deﬁnes the consistency gap ∆ cons ≜ Acc cons − Acc incons , whic h quan tiﬁes the mo del’s reliance on visual–audio alignment: a large p ositive gap reveals that the mo del struggles when the visible face do es not match the sp eaker’s voice. T o chec k for systematic p ositional bias (e.g., alwa ys selecting option A), we additionally rep ort macro-av eraged F1 across the four answer p ositions on parsable outputs. The complete pro cedure is summarized in Algorithm 1 . T urn-taking timing metric ( when ). The generation split contains N g = 209 clips, each annotated with a ground-truth turn-entry timestamp τ ⋆ i and a candidate sp eaker X i . T o simulate real-time reception, we incremen tally extend the visible preﬁx by one second at each step and query the mo del with “Should X i sp eak now?” W e ev aluate strides of 0.5 s, 1 s, and 2 s; the 1 s stride provides a fav orable trade-oﬀ b etw een ev aluation cost and temp oral precision (App endix A.6 ). Let ˆ τ i denote the ﬁrst timestamp at which the mo del answ ers YES . The signed resp onse oﬀset ∆ τ i ≜ ˆ τ i − τ ⋆ i captures the deviation from the ide al entry p oint, where negativ e v alues indicate premature interruption and p ositive v alues indicate delay ed resp onse. Based on thresholds of (1 , 2 , 5) s, we assign each clip to one of ﬁve timing categories: Interrupted ( ∆ τ i < − 1 s) means the mo del disrupts the ongoing turn; Perfect ( − 1 ≤ ∆ τ i ≤ 2 s) indicates an acceptable entry window; Dela yed ( 2 < ∆ τ i ≤ 5 s) marks a noticeably late but still relev an t resp onse; TooLa te ( ∆ τ i > 5 s) signals that the conv ersational window has passed; and NoResponse means the mo del nev er answers YES . W e collapse these into three summary groups: Early (E) = Interrupted , On-time (O) = Perfect , and Late (L) = Dela ye d ∪ TooLa te ∪ NoResponse . The primary when -score is the On-time rate O, the fraction of clips in the Perfect window. Threshold justiﬁcation is in App endix A.7 . Resp onse quality metric ( how ). F or every clip in whic h the mo del decides to sp eak, it pro duces a resp onse ˆ s i , whic h we rigorously assess via an LLM-as-a-judge proto col [ 30 , 58 ] with three indep endent judges: GPT-4o [ 18 ], Gemini 2.5 Pro [ 12 ], and Qwen3-Omni [ 50 ]. Each judge receives the full ASR transcript, the annotated reference contin uation, and the mo del’s resp onse, then assigns a score on a four-level scale { 25 , 50 , 75 , 100 } ; this coarse granularit y reduces judge hesitation and improv es inter-judge agreement [ 30 , 58 ]. The per-clip score is the three-judge mean ¯ s i = 1 3 ( s (1) i + s (2) i + s (3) i ) , and the dataset-level ho w-score a v erages ¯ s i o ver all clips with non-empty resp onses. T wo imp ortant auxiliary metrics accompany the how-score: response co v erage Co v = |G | / N g , recording the fraction of clips for which the mo del pro duces a v alid utterance, and the large-gap rate R gap , measuring the fraction of clips on which at least tw o judges disagree b y ≥ 25 p oints. The coupled ev aluation pip eline for the when and how tasks is given in Algorithm 2 . 6 Algorithm 1 P erception Ev aluation ( who ) Require: Clips { ( V i , A i , t i , O i , y ⋆ i ) } N p i =1 ; mo del f Ensure: Acc all , Acc cons , Acc incons , ∆ cons , macro-F1 1: for each clip i do 2: F eed video V i , audio A i , timestamp t i , and options O i to f ; obtain prediction ˆ y i 3: if ˆ y i is non-parsable then 4: Mark as incorrect 5: else 6: Compare ˆ y i with ground truth y ⋆ i ; up date p er-subset counters 7: end if 8: end for 9: Compute ov erall, consistent, and inconsistent accuracy 10: ∆ cons ← Acc cons − Acc incons 11: Compute macro-F1 ov er the four answer p ositions 4 Exp eriments W e systematically organize exp eriments around tw o questions: (1) Comprehensively where do curren t omni- mo dal mo dels stand on the three axes of so cial interaction? and (2) What capabilities are still missing, and where do mo dels fail collectively? W e ﬁrst introduce the setup (§ 4.1 ), then present the main results with a detailed uniﬁed leaderb oard and capability proﬁles (§ 4.2 ), and thoroughly conduct diagnostic analysis across three lay ers: p erception reliability , timing–resp onse b ehavior, and failure cases (§ 4.3 ). 4.1 Experiment Setup Mo dels. W e ev aluate tw elv e omni-mo dal large language mo dels spanning commercial APIs and op en-source systems across diverse b enc hmarks. Commercial: GPT-4o [ 18 ], Gemini 2.5 Pro/Flash [ 7 ], Gemini 3 Flash/Pro Preview [ 12 ]. Op en-source: Qwen3-Omni, Qwen3-Omni-Thinking, Qwen2.5-Omni [ 50 ], OmniVinci [ 51 ], Baic huan-Omni-1.5 [ 25 ], VIT A-1.5 [ 10 ], and MiniOmni2 [ 48 ]. MiniOmni2 lacks a stable generation interface and is ev aluated on p erception only due to technical limitations. W e use the default system prompts and generation settings for all mo dels to ensure fair comparison across diﬀerent platforms. Inputs and prompting. All mo dels receive raw video (deco ded at 30 fps) and audio (native sampling rate) under a uniﬁed interface. Ground-truth transcripts are never exp osed to ev aluated mo dels; they are used solely by judges for resp onse-quality scoring. Prompt templates are ﬁxed across all mo dels (App endix A.14 ). No single mo del dominates all three axes. The leader diﬀers by axis: Qwen3-Omni on who (69.25%), Gemini 3 Pro Preview on when (67.31%), and Gemini 2.5 Flash on how (85.08). Ev ery radar p olygon in Figure 3 is visibly lopsided, conﬁrming that a single aggregate score would mask critical axis-sp eciﬁc gaps. Op en-source mo dels lag substan tially b ehind commercial systems. The gap is particularly pro- nounced on resp onse qualit y: the best op en-source how score (Qwen2.5-Omni, 66.15) trails the best commercial score (Gemini 2.5 Flash, 85.08) by nearly 19 p oin ts. Mo dels such as VIT A-1.5 (12.49) and Baich uan-Omni-1.5 (27.27) pro duce ﬂuen t but contextually irrelev an t resp onses. On when , the gap is narrow er but consistently fa v ors commercial APIs. On who , the ov erall picture is mixed: Qw en3-Omni leads all mo dels, while most other op en-source systems remain b elow the commercial median. P erception and generation abilities do not correlate. Rank inv ersion is striking and clearly visible: Qw en3-Omni-Thinking achiev es relatively comp etitive who yet falls among the low est on how (18.06), while GPT-4o surprisingly sho ws lo w who (36.75%) but strong how (69.64). This decoupling conﬁrms that con versational in teractivit y m ust b e prop erly ev aluated as a multi-dimensional proﬁle. 7 Algorithm 2 Generation Ev aluation ( when–how ) Require: Clips { ( V i , A i , τ ⋆ i , X i ) } N g i =1 ; mo del f ; stride δ = 1 s; judges { J k } 3 k =1 ; thresholds ( α, β , γ ) = (1 , 2 , 5) s Ensure: Timing distribution, ∆ τ , Score how , Cov , R gap 1: for each clip i do 2: for t = δ, 2 δ, . . . , T i do 3: Sho w the ﬁrst t seconds to f and ask “Should X i sp eak now?” 4: if f answers YES then 5: Record entry time ˆ τ i ← t ; break 6: end if 7: end for 8: if f nev er answers YES then 9: Label as NoResponse ; con tinue 10: end if 11: Compute oﬀset ∆ τ i ← ˆ τ i − τ ⋆ i 12: Assign timing category: Interrupted / Perfect / Dela yed / TooLa te 13: Ask f to generate a resp onse ˆ s i giv en the ﬁrst ˆ τ i seconds 14: for each judge J k do 15: s ( k ) i ← J k ( transcript , reference , ˆ s i ) ∈ { 25 , 50 , 75 , 100 } 16: end for 17: P er-clip score: ¯ s i ← 1 3 P k s ( k ) i 18: end for 19: Aggregate timing distribution and mean oﬀset ∆ τ 20: Score how ← av erage ¯ s i o ver clips with resp onses 21: Co v ← fraction of clips with resp onses 22: R gap ← fraction of clips where judges disagree b y ≥ 25 p oints 4.2 Main Results Who Who-Cons. Who-Incons. Robustness (100-| |) When-Acc. When-F1 On-time How 0.25 0.50 0.75 1.00 Model Legend GPT-4o Gemini 2.5 Pro Gemini 2.5 Flash Gemini 3 Flash Gemini 3 Pro Qwen3-Omni Qwen3-Omni-Think Qwen2.5-Omni OmniVinci VITA-1.5 Baichuan-Omni-1.5 Figure 3. Cross-axis capabilit y proﬁles. Each p olygon shows one model ov er normalized who – when – how dimensions. No single mo del dominates all axes, revealing distinct strengths and weaknesses. 8 T able 2. So cialOmni main p erformance across the who – when – how axes. Who is top-1 accuracy on the p erception split (2,000 items). When is timing accuracy on the generation split (209 items). How is judge score (/100). ‘–‘ indicates not supp orted by in terface constraints. * MiniOmni2 is ev aluated on p erception only due to unav ailable stable generation interface and signiﬁcan t tec hnical implementation constrain ts. Mo del Who (%) When (%) Ho w (/100) GPT-4o [ 18 ] 36.75 46.89 69.64 Gemini 2.5 Pro [ 7 ] 44.69 55.67 72.32 Gemini 2.5 Flash [ 7 ] 47.03 61.50 85.08 Gemini 3 Flash Preview [ 12 ] 53.23 61.06 79.08 Gemini 3 Pro Preview [ 12 ] 64.99 67.31 81.77 Qw en3-Omni [ 50 ] 69.25 63.64 45.57 Qw en3-Omni-Thinking [ 50 ] 54.60 46.41 18.06 Qw en2.5-Omni [ 50 ] 36.75 57.42 66.15 OmniVinci [ 51 ] 35.86 41.63 55.86 VIT A-1.5 [ 10 ] 36.95 43.37 12.49 Baic huan-Omni-1.5 [ 25 ] 25.65 46.88 27.27 MiniOmni2* [ 48 ] 16.72 – – T able 3. Perception-task (who) sp eaker identiﬁcation metrics (b o otstrap 95% CI, 10,000 resamples, seed=42). Mo del A cc. A cc. [95% CI] F1-m F1-m [95% CI] GPT-4o [ 18 ] 36.75 [34.66, 38.89] 35.80 [33.63, 37.97] Gemini 2.5 Pro [ 7 ] 44.69 [42.52, 46.88] 44.53 [42.39, 46.67] Gemini 2.5 Flash [ 7 ] 47.03 [44.82, 49.24] 46.75 [44.56, 48.93] Gemini 3 Flash Preview [ 12 ] 53.23 [51.04, 55.41] 53.36 [51.18, 55.53] Gemini 3 Pro Preview [ 12 ] 64.99 [62.86, 67.06] 65.02 [62.93, 67.08] Qw en3-Omni [ 50 ] 69.25 [67.19, 71.23] 68.81 [66.72, 70.81] Qw en3-Omni-Thinking [ 50 ] 54.60 [52.43, 56.76] 53.99 [51.65, 56.25] Qw en2.5-Omni [ 50 ] 36.75 [34.66, 38.89] 33.38 [31.24, 35.49] OmniVinci [ 51 ] 35.86 [32.64, 39.14] 31.09 [27.75, 34.33] VIT A-1.5 [ 10 ] 36.97 [34.86, 39.03] 34.43 [32.29, 36.50] Baic huan-Omni-1.5 [ 25 ] 25.65 [23.78, 27.61] 16.67 [15.49, 17.86] 4.3 Diagnostic Analysis The main results rev eal what the landscap e lo oks like; we now ask wh y . W e structure the diagnosis into three la yers: p erception reliability , timing-and-resp onse coupling, and universal failure mo des. 4.3.1 Who: Perception Reliabilit y T able 3 supplements who accuracy with macro-F1 and 95% b o otstrap conﬁdence interv als. The ov erall ranking is how ever broadly preserved, but several mo dels show a notable accuracy-to-F1 drop, indicating unev en p erformance across answer p ositions, a hallmark of p ositional selection bias (e.g., consistently fav oring option A). This strongly v alidates the use of macro-F1 as a complementary reliability metric: mo dels that app ear comp etitive on accuracy alone may b e unreliable when class balance is enforced. 4.3.2 When + How: Timing Behavior and Resp onse Qualit y In terruption vs. delay . Figure 4 decomp oses every mo del’s timing predictions into E/O/L phases, rev ealing t wo distinctly opp osing failure mo des. Aggressive mo dels (notably , e.g., Qwen2.5-Omni, E = 22.5%; VIT A-1.5, E = 21.9%) frequen tly interrupt the ongoing sp eaker b efore the turn b oundary , demonstrating po or turn-taking a wareness. Conserv ativ e mo dels (e.g., OmniVinci, L = 54.5%; GPT-4o, L = 45.5%) rarely in terrupt but miss the con v ersational windo w en tirely , sacriﬁcing resp onsiveness for caution. The b est when p erformers (e.g., Gemini 3 Pro, E = 5.3%, L = 27.4%) achiev e lo w E and low L simultaneously , thereb y reﬂecting a well-calibrated en try strategy that balances timing precision with conv ersational naturalness. 9 T able 4. T urn-taking timing (when) reliability on the generation task ( δ = 0 . 2 s, b o otstrap 95% CI). Mo del A cc. A cc. [95% CI] P R F1 GPT-4o [ 18 ] 46.89 [40.19, 53.59] 70.37 28.57 40.64 Gemini 2.5 Pro [ 7 ] 55.67 [48.77, 62.56] 75.95 45.80 57.14 Gemini 2.5 Flash [ 7 ] 61.50 [54.50, 68.00] 73.45 63.85 68.31 Gemini 3 Flash Preview [ 12 ] 61.06 [54.33, 67.79] 73.21 61.65 66.94 Gemini 3 Pro Preview [ 12 ] 67.31 [61.06, 73.56] 87.36 57.14 69.09 Qw en3-Omni [ 50 ] 63.64 [56.94, 69.86] 76.64 61.65 68.33 Qw en3-Omni-Thinking [ 50 ] 46.41 [39.71, 53.11] 60.61 45.11 51.72 Qw en2.5-Omni [ 50 ] 57.42 [50.72, 64.11] 65.94 68.42 67.16 OmniVinci [ 51 ] 41.63 [34.93, 48.33] 70.37 14.29 23.75 VIT A-1.5 [ 10 ] 43.37 [36.73, 50.51] 55.21 43.80 48.85 Baic huan-Omni-1.5 [ 25 ] 46.89 [40.19, 53.59] 59.32 52.63 55.78 Precision–recall trade-oﬀ. Figure 5 strikingly reveals that mo dels with ostensibly similar On-time rates can o ccupy very diﬀerent p ositions in precision–recall space: a high-precision / low-recall mo del is inherently o verly cautious (correct when it sp eaks but misses v alid en try p oin ts), while conv ersely the reverse signals a trigger-happ y strategy . This shows that timing b ehavior is fundamentally a tw o-dimensional trade-oﬀ that the single On-time p ercentage in T able 2 do es not adequately capture. Timing–qualit y coupling. Comparing systematically Figure 4 with the how column of T able 2 , premature en try (high E) do es not necessarily alwa ys degrade resp onse quality , some models remark ably generate reasonable contin uations even when entering slightly early . Conv ersely , very late en try (high L) robustly consisten tly correlates with low er how scores, b ecause the mo del misses the relev ant conv ersational context. Go o d timing is ultimately a necessary but not suﬃcient condition for go o d resp onse quality . Models 0 20 40 60 80 100 Ratio (%) GPT-4o Gemini 2.5 Pro Gemini 2.5 Flash Gemini 3 Flash Gemini 3 Pro Qwen3-Omni Qwen3-Omni-Thinking Qwen2.5-Omni OmniVinci VITA-1.5 Baichuan-Omni-1.5 Early (E) On-time (O) Late (L) Figure 4. Timing-phase decomp osition for turn entry . Early/On-time/Late rates exp ose whether a mo del tends to interrupt prematurely or miss the optimal conv ersational window during dialogue. 4.3.3 F ailure Cases Bey ond critically aggregate metrics, we systematically insp ect cases where the ma jorit y of ev aluated mo dels consisten tly fail, thereby identifying systemic b ottlenecks rather than individual mo del weaknesses. P erception failures. T wo dominant patterns emerge. (i). Cross-mo dal temp oral incoherence: when the camera cuts to a reaction shot while the sp eaker con tin ues oﬀ-screen, most mo dels attribute the utterance to 10 10 20 30 40 50 60 70 80 90 Recall (%) 50 55 60 65 70 75 80 85 90 Precision (%) Iso-F1 guideline (--) GPT-4o Gemini 2.5 Pro Gemini 2.5 Flash Gemini 3 Flash Gemini 3 Pro Qwen3-Omni Qwen3-Omni-Thinking Qwen2.5-Omni OmniVinci VITA-1.5 Baichuan-Omni-1.5 Figure 5. Precision–recall op erating p oin ts for when decisions. Iso-F1 guides highlight the fundamen tal trade-oﬀ b etw een cautious and trigger-happy turn-en try strategies in dialogue systems. the visually salient face rather than maintaining sp eak er–iden tity binding across frames. This reﬂects a failure to reconcile “who was sp eaking b efore the cut” with “who is visible now,” a deﬁcit in temp oral cross-mo dal coherence rather than in either mo dality alone. (ii). Correct transcription, wrong sp eaker: mo dels often select the option matching the correct ASR conten t yet assign it to the wrong on-screen p erson. The p erception pip eline eﬀectiv ely collapses to text matching, bypassing genuine voice–face grounding such as timbre or lip-sync alignment. These tw o mo des together explain the ma jority of p erception errors and are ampliﬁed in the inconsistent subset where visual cues are delib erately unreliable. Generation failures. T w o distinctly parallel patterns app ear on the generation side. (i) Premature in terruption: mo dels inadv ertently frequently trigger a turn entry at proso dic pauses or he sitations that merely resemble turn-ﬁnal cues, indicating reliance on shallow silence-gap detection rather than integrating discourse-lev el signals suc h as syntactic incompleteness, sustained eye contact, or rising intonation. (ii) Con textually incoheren t con tinuation: ev en when a mo del times the in terruption correctly , the generated con tent is often generic or tonally mismatc hed, ignoring the emotional tenor, topic tra jectory , and interpersonal dynamics established in the prior context. This directly instantiates the p erception–generation decoupling cen tral to So cialOmni: correct p erception do es not guaran tee so cially appropriate generation. Overall, these four failure mo des conﬁrm that so cial interactivit y m ust b e ev aluated as a joint who – when – how proﬁle; strong p erformance on any single axis do es not preclude systemic failure on the others. 5 Conclusion In this pap er, we present So cialOmni, a comprehensive b enchmark for joint who – when – how ev aluation in omni-mo dal large language mo dels, where T ask I targets sp eaker iden tiﬁcation ( who ) and T as k I I targets turn timing ( when ) and resp onse generation ( how ). Exp eriments on 12 OLMs systematically show rank decoupling b etw een p erceptual accuracy and generation quality , alongside heterogeneous robustness under sp eak er-camera mismatch. These ﬁndings suggest that understanding accuracy alone cannot characterize con versational so cial comp etence, underscoring the urgent need for in teraction-orien ted ev aluation. Limitations and F uture W ork. The generation subset serves as a con trolled diagnostic and do es not exhaustiv ely cov er all dialogue transitions. The T ask I I ev aluation, esp ecially its resp onse-quality comp onent, relies on transcrib ed mo del outputs and may underweigh t visual grounding and proso dic cues. In the future, w e will scale So cialOmni to m ulti-turn in teraction tra jectories, incorp orate human ev aluation for pragmatically subtle cases, and extend mo dality cov erage to proso dy- and gesture-aw are assessments. 11 References [1] Jean-Baptiste Alayrac, Jeﬀ Donah ue, Pauline Luc, Antoine Miech, Iain Barr, Y ana Hasson, Karel Lenc, Arthur Mensc h, Katie Millican, Malcolm Reynolds, et al. Flamingo: A visual language mo del for few-shot learning. arXiv pr eprint arXiv:2204.14198 , 2022. URL . [2] Siddhan t Arora, Zhiyun Lu, Ch ung-Cheng Chiu, Ruoming P ang, and Shinji W atanab e. T alking turns: Bench- marking audio foundation mo dels on turn-taking dynamics. arXiv pr eprint arXiv:2503.01174 , 2025. URL https://arxiv.org/abs/2503.01174 . [3] Jianghan Chao, Jianzhang Gao, W enhui T an, Y uchong Sun, Ruihua Song, and Liyun Ru. JointA VBench: A b enc hmark for joint audio-visual reasoning ev aluation. arXiv pr eprint arXiv:2512.12772 , 2025. URL ht t ps : //arxiv.org/abs/2512.12772 . [4] W ang Chen, Y ongdong Luo, Y uhui Zeng, Luo jun Lin, Tianyu Xie, F ei Chao, Rongrong Ji, and Xiawu Zheng. Ev en t-anc hored frame selection for eﬀective long-video understanding. arXiv pr eprint arXiv:2603.00983 , 2026. [5] W ang Chen, Y uhui Zeng, Y ongdong Luo, Tianyu Xie, Luo jun Lin, Jiayi Ji, Y an Zhang, and Xiawu Zheng. W av elet-based frame selection by detecting semantic b oundary for long video understanding. arXiv pr eprint arXiv:2603.00512 , 2026. [6] Sanjo y Chowdh ury , Karren Dai Y ang, Xudong Liu, F artash F aghri, Pa v an Kumar Anasosalu V asu, Oncel T uzel, Dinesh Mano cha, Chun-Liang Li, and Raviteja V em ulapalli. AMUSE: Audio-visual b enchmark and alignmen t framework for agentic multi-speaker understanding. arXiv pr eprint arXiv:2512.16250 , 2025. URL https://arxiv.org/abs/2512.16250 . [7] Gheorghe Comanici, Eric Bieb er, Mike Schaek ermann, Ice Pasupat, No veen Sachdev a, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Ev an Rosen, et al. Gemini 2.5: Pushing the F rontier with Adv anced Reasoning, Multimo dalit y , Long Context, and Next Generation Agentic Capabilities. arXiv pr eprint arXiv:2507.06261 , 2025. URL . [8] Bradley Efron. Bo otstrap metho ds: Another lo ok at the jackknife. The Annals of Statistics , 7(1):1–26, 1979. doi: 10.1214/aos/1176344552. URL https://projecteuclid.org/journals/annals- of- statistics/volume- 7/iss ue- 1/Bootstrap- Methods- - Another- Look- at- the- Jackknife/10.1214/aos/1176344552.full . [9] Chao y ou F u, Y uhan Dai, Y ongdong Luo, Lei Li, Shuh uai Ren, Renrui Zhang, Zihan W ang, Chen yu Zhou, Y unhang Shen, Mengdan Zhang, et al. Video-mme: The ﬁrst-ev er comprehensive ev aluation benchmark of m ulti-modal llms in video analysis. In Pr o c e e dings of the Computer Vision and Pattern R e c o gnition Confer enc e , pages 24108–24118, 2025. URL http s:// open acce ss.t hecv f.co m/co nten t/CV PR20 25/h tml/ Fu_V ideo - MME _The _Fir st- E ver_ Comp rehensive_Evaluation_Benchmark_of_Multi- modal_LLMs_in_CVPR_2025_paper.html . [10] Chao you F u, Hao jia Lin, Xiong W ang, Yi-F an Zhang, Y unhang Shen, Xiaoyu Liu, Y angze Li, Zuw ei Long, Heting Gao, Ke Li, et al. Vita-1.5: T o w ards gpt-4o level real-time vision and sp eec h interaction. arXiv pr eprint arXiv:2501.01957 , 2025. URL . [11] Zhifu Gao, Zerui Li, Jiaming W ang, Haoneng Luo, Xian Shi, Mengzhe Chen, Y abin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, and Shiliang Zhang. F unasr: A fundamen tal end-to-end sp eech recognition to olkit. arXiv pr eprint arXiv:2305.11013 , 2023. URL . [12] Go ogle. Gemini 3: Introducing the latest gemini ai mo del from go ogle. https://blog.google/products- and- p latforms/products/gemini/gemini- 3/ , 2025. Go ogle Blog, Nov 18, 2025. A ccessed: 2026-03-01. [13] Go ogle AI for Developers. Release notes | gemini api | go ogle ai for developers. h t t p s : / / a i . g o o g l e . d e v / g e m i n i - a p i / d o c s / c h a n g e l o g , 2026. Do cuments launc h/update records for gemini-3-pro-preview and gemini-3-flash-preview . Accessed: 2026-03-01. [14] Jac k Hong, Shilin Y an, Jia yin Cai, Xiaolong Jiang, Y ao Hu, and W eidi Xie. W orldSense: Ev aluating real- w orld omnimodal understanding for multimodal llms. arXiv pr eprint arXiv:2502.04326 , 2025. URL h t t p s : //arxiv.org/abs/2502.04326 . [15] W eizhong Huang, Y uxin Zhang, Xiawu Zheng, F ei Chao, and Rongrong Ji. Determining lay er-wise sparsity for large language mo dels through a theoretical p ersp ectiv e. arXiv pr eprint arXiv:2502.14770 , 2025. [16] W eizhong Huang, Y uxin Zhang, Xiawu Zheng, F ei Chao, Rongrong Ji, and Liujuan Cao. Discov ering imp ortan t 12 exp erts for mixture-of-exp erts mo dels pruning through a theoretical p ersp ective. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025. [17] W eizhong Huang, Y uxin Zhang, Xiawu Zheng, Y ang Liu, Jing Lin, Yiwu Y ao, and Rongrong Ji. Dynamic lo w-rank sparse adaptation for large language mo dels. arXiv preprint , 2025. [18] Aaron Hurst, Adam Lerer, A dam P Gouc her, Adam P erelman, Adit ya Ramesh, Aidan Clark, AJ Ostrow, Akila W elihinda, Alan Hay es, Alec Radford, et al. GPT-4o System Card. arXiv pr eprint arXiv:2410.21276 , 2024. URL https://arxiv.org/abs/2410.21276 . [19] Shixin Jiang, Jiafeng Liang, Jiyuan W ang, Xuan Dong, Heng Chang, W eijiang Y u, Jinhua Du, Ming Liu, and Bing Qin. F rom sp eciﬁc-mllms to omni-mllms: a survey on mllms aligned with multi-modalities. In Findings of the Asso ciation for Computational Linguistics: ACL 2025 , pages 8617–8652, 2025. doi: 10.18653/v 1/2025.f inding s- acl.453. URL https://aclanthology.org/2025.findings- acl.453/ . [20] F anqi Kong, W eiqin Zu, Xinyu Chen, Y ao dong Y ang, Song-Chun Zhu, and Xue F eng. SIV-Benc h: A video b enc hmark for so cial interaction understanding and reasoning. arXiv pr eprint arXiv:2506.05425 , 2025. URL https://arxiv.org/abs/2506.05425 . [21] Bohao Li, Y uying Ge, Yixiao Ge, Guangzhi W ang, Rui W ang, Ruimao Zhang, and Ying Shan. Seed-b ench- 2: Benchmarking multimodal large language mo dels. arXiv pr eprint arXiv:2311.17092 , 2023. URL h t t p s : //arxiv.org/abs/2311.17092 . [22] Caorui Li, Y u Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Y uanxing Zhang, Jiafu T ang, Zhen Song, Dingling Zhang, Yinghui He, Haoxian Liu, Y uxuan W ang, Qiufeng W ang, Zhenhe W u, Jiehui Luo, Zhiyu Pan, W eihao Xie, Chenc hen Zhang, Zhaohui W ang, Jiayi Tian, Y anghai W ang, Zhe Cao, Minxin Dai, Kefeng W ang, Runzhe W en, Ying Ma, Y aning Pan, Sungkyun Chang, T ermeh T aheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Y ang, Tianhao Peng, Zili W ang, Minghao Liu, Junran Peng, Zhao-Hui Zhang, and Jiaheng Liu. OmniVideoBench: T o w ards audio-visual understanding ev aluation for omni mllms. arXiv pr eprint arXiv:2510.10689 , 2025. URL . [23] Dan y ang Li, Tianhao W u, Bin Lin, Zhenyuan Chen, Y ang Zhang, Y uxuan Li, Ming-Ming Cheng, and Xiang Li. W O W-seg: A word-free op en world segmen tation mo del. In The F ourte enth International Confer enc e on L e arning R epr esentations , 2026. URL https://openreview.net/forum?id=AyJPSnE1bq . [24] Kunc hang Li, Y ali W ang, Yinan He, Yizh uo Li, Yi W ang, Yi Liu, Zun W ang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvb ench: A comprehensive multi-modal video understanding benchmark. In Pr o c e e dings of the IEEE/CVF Confer ence on Computer Vision and Pattern R e c o gnition , pages 22195–22206, 2024. URL htt ps: / /op ena cce ss. the cvf .co m/c o nte nt/ CVP R20 24/ htm l/L i_M VBe n ch_ A_C omp reh ens ive _Mu lti - mod al_ Vid eo_Understanding_Benchmark_CVPR_2024_paper.html . [25] Y adong Li, Jun Liu, T ao Zhang, Song Chen, Tianp eng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baic huan-Omni-1.5 T echnical Rep ort. arXiv pr eprint arXiv:2501.15368 , 2025. URL https://arxiv.org/abs/2501.15368 . [26] Yizhi Li, Ge Zhang, Yi Ma, Ruibin Y uan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Jian Y ang, Siw ei W u, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhen Y ang, Xiangzhou W ang, Zhao xiang Zhang, Zachary Liu, Emmanouil Benetos, W enhao Huang, and Chenghua Lin. OmniBench: T ow ards the future of universal omni-language mo dels. arXiv pr eprint arXiv:2409.15272 , 2024. URL . [27] Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui W ang, Gopala Anumanc hipalli, Alexander H. Liu, and Hung-yi Lee. F ull-duplex-b ench: A b enc hmark to ev aluate full-dup lex sp oken dialogue models on turn-taking capabilities. arXiv pr eprint arXiv:2503.04721 , 2025. URL . [28] Zhao jiang Lin, Y ong Xu, Kai Sun, Jing Zheng, Yin Huang, Surya T eja Appini, Krish Narang, Renjie T ao, Ishan Kapil Jain, Siddhant Arora, et al. W earvo x: An ego centric multic hannel voice assistant b enc hmark for w earables. arXiv pr eprint arXiv:2601.02391 , 2025. URL . [29] Haotian Liu, Chun yuan Li, Y uheng Li, and Y ong Jae Lee. Impro ved baselines with visual instruction tuning. arXiv pr eprint arXiv:2310.03744 , 2023. URL . [30] Y ang Liu, Dan Iter, Yichong Xu, Sh uohang W ang, Ruo chen Xu, and Chenguang Zh u. G-Ev al: Nlg ev aluation using gpt-4 with b etter human alignment. arXiv pr eprint arXiv:2303.16634 , 2023. URL s/2303.16634 . 13 [31] Y uanzhan Liu, Hao dong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, W angb o Zhao, Yike Y uan, Jiaqi W ang, Congh ui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your m ulti-mo dal mo del an all-around play er? Eur op e an Confer ence on Computer Vision , 2023. URL . [32] Y ongdong Luo, Xiawu Zheng, Guilin Li, Shuk ang Yin, Hao jia Lin, Chaoy ou F u, Jinfa Huang, Jiayi Ji, F ei Chao, Jieb o Luo, et al. Video-rag: Visually-aligned retriev al-augmented long video comprehension. arXiv pr eprint arXiv:2411.13093 , 2024. [33] Y ongdong Luo, W ang Chen, Xiawu Zheng, W eizhong Huang, Shuk ang Yin, Hao jia Lin, Chaoy ou F u, Jinfa Huang, Jiayi Ji, Jiebo Luo, et al. Quota: Query-oriented token assignmen t via cot query decouple for long video comprehension. arXiv pr eprint arXiv:2503.08689 , 2025. [34] Y uexiao Ma, T aisong Jin, Xiawu Zheng, Y an W ang, Huixia Li, Y ong jian W u, Guannan Jiang, W ei Zhang, and Rongrong Ji. Omp q: Orthogonal mixed precision quantization. In Pr o c e e dings of the AAAI c onfer enc e on artiﬁcial intel ligenc e , volume 37, pages 9029–9037, 2023. [35] Y uexiao Ma, Huixia Li, Xiawu Zheng, F eng Ling, Xuefeng Xiao, Rui W ang, Shilei W en, F ei Chao, and Rongrong Ji. Aﬃnequant: Aﬃne transformation quantization for large language mo dels. arXiv pr eprint arXiv:2403.12544 , 2024. [36] Y uexiao Ma, Huixia Li, Xia wu Zheng, F eng Ling, Xuefeng Xiao, Rui W ang, Shilei W en, F ei Chao, and Rongrong Ji. Outlier-a w are slicing for p ost-training quan tization in vision transformer. In F orty-ﬁrst International Confer ence on Machine L e arning , 2024. [37] Y uexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, F eng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing W ang, Xuefeng Xiao, et al. Flow cac hing for autoregressive video generation. arXiv pr eprint arXiv:2602.10825 , 2026. [38] Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-philipp e Morency . Social genome: Grounded so cial reasoning abilities of multimodal mo dels. Confer enc e on Empiric al Metho ds in Natur al L anguage Pr oc essing , 2025. URL . [39] Leyi Pan, Zheyu F u, Y unp eng Zhai, Shuc hang T ao, Sheng Guan, Shiyu Huang, Lingzhe Zhang, Zhaoy ang Liu, Bolin Ding, F elix Henry , Lijie W en, and Aiwei Liu. Omni-SafetyBenc h: A b enchmark for safety ev aluation of audio-visual large language mo dels. arXiv pr eprint arXiv:2508.07173 , 2025. URL . [40] Alec Radford, Jong W o ok Kim, Chris Hallacy , A dit y a Ramesh, Gabriel Goh, Sandhini Agarw al, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual mo dels from natural language sup ervision. In International Confer enc e on Machine L e arning , pages 8748–8763, 2021. URL https://proceedi ngs.mlr.press/v139/radford21a . [41] Alec Radford, Jong W ook Kim, T ao Xu, Greg Brockman, Christine McLeav ey , and Ilya Sutskev er. Robust sp eec h recognition via large-scale weak sup ervision. Pr o c e e dings of the 40th International Confer ence on Machine L e arning (ICML) , 202:28492–28518, 2023. URL https://proceedings.mlr.press/v202/radford23a.html . [42] Harv ey Sacks, Emanuel A. Schegloﬀ, and Gail Jeﬀerson. A simplest systematics for the organization of turn-taking for conv ersation. L anguage , 50(4):696–735, 1974. doi: 10.2307/412243. [43] Ramanesw aran Selv akumar, Ashish Seth, Nishit Anand, Utk arsh Ty agi, Sonal Kumar, Sreyan Ghosh, and Dinesh Mano c ha. Multiv ox: A b enchmark for ev aluating voice assistants for multimodal interactions. arXiv preprint arXiv:2507.10859 , 2025. URL . [44] Gabriel Sk antze. T urn-taking in conv ersational systems and h uman-rob ot interaction: A review. Computer Sp e e ch & L anguage , 67:101178, 2021. doi: 10.1016/j.csl.2020.101178. URL https://doi.org/10.1016/j.csl.2020.101178 . [45] Bin W ang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, W en yu Zhang, Zhengyuan Liu, AiTi A w, and Nancy Chen. Audiob ench: A universal benchmark for audio large language mo dels. In Pr o c e e dings of the 2025 Confer enc e of the Nations of the Americ as Chapter of the Association for Computational Linguistics: Human L anguage T e chnologies (V olume 1: L ong Pap ers) , pages 4297–4316, 2025. doi: 10.18653/v1/2025.naacl- long.218. URL https://aclanthology.org/2025.naacl- long.218/ . [46] Ke W ang, Houxing Ren, Zimu Lu, Mingjie Zhan, and Hongsheng Li. V oiceassistant-ev al: Benchmarking ai assistants across listening, speaking, and viewing. arXiv pr eprint arXiv:2509.22651 , 2025. URL h t t p s : //arxiv.org/abs/2509.22651 . 14 [47] Y uxuan W ang, Y ueqian W ang, Borun Chen, T ong W u, Dongyan Zhao, and Zilong Zheng. OmniMMI: A comprehensiv e multi-modal interaction b enchmark in streaming video contexts. 2025 IEEE/CVF Confer enc e on Computer Vision and Pattern R e co gnition (CVPR) , 2025. doi: 1 0 . 1 1 09 / C V P R 5 2 7 3 4 . 2 02 5 . 0 1 7 63. URL https://arxiv.org/abs/2503.22952 . [48] Zhifei Xie and Changqiao W u. Mini-Omni2: T ow ards op en-source GPT-4o with vision, sp eech and duplex capabilities. arXiv pr eprint arXiv:2410.11190 , 2024. URL . [49] Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin W ang, Y ang F an, Kai Dang, Bin Zhang, Xiong W ang, Y unfei Chu, and Juny ang Lin. Qwen2.5-Omni T ec hnical Rep ort. arXiv pr eprint arXiv:2503.20215 , 2025. URL . [50] Jin Xu, Zhifang Guo, Hangrui Hu, Y unfei Chu, Xiong W ang, Jinzheng He, Y uxuan W ang, Xian Shi, Ting He, Xinfa Zhu, Y uanjun Lv, Y ongqi W ang, Dake Guo, He W ang, Linhan Ma, Pei Zhang, Xin yu Zhang, Hongkun Hao, Zishan Guo, Baosong Y ang, Bin Zhang, Ziyang Ma, Xipin W ei, Sh uai Bai, Keqin Chen, Xuejing Liu, Peng W ang, Mingkun Y ang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, F an Zhou, Bow en Y u, Jianxin Y ang, Le Y u, Jingren Zhou, and Juny ang Lin. Qw en3-Omni T echnical Report. arXiv pr eprint arXiv:2509.17765 , 2025. URL https://arxiv.org/abs/2509.17765 . [51] Hanrong Y e, Chao-Han Huc k Y ang, Arushi Go el, W ei Huang, Ligeng Zhu, Y uanhang Su, Sean Lin, An-Chieh Cheng, Zhen W an, Jinch uan Tian, et al. OmniVinci: Enhancing arc hitecture and data for omni-mo dal understanding llm. arXiv pr eprint arXiv:2510.15870 , 2025. URL . [52] W enyi Y u, Siyin W ang, Xiaoyu Y ang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Y uxuan W ang, and Chao Zhang. SALMONN-omni: A co dec-free llm for full-duplex sp eech understanding and generation. arXiv pr eprint arXiv:2411.18138 , 2024. URL . [53] Xiang Y ue, Y uansheng Ni, Kai Zhang, Tianyu Zheng, Ruo qi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, W eiming Ren, Y uxuan Sun, et al. MMMU: A massive m ulti-discipline multimodal understanding and reasoning b enc hmark for exp ert agi. arXiv pr eprint arXiv:2311.16502 , 2023. URL . [54] Dan y ang Zhang, Junhao Song, Ziqian Bi, Yingfang Y uan, Tiany ang W ang, Jo e Y eong, and Junfeng Hao. Mixture of exp erts in large language mo dels. arXiv pr eprint arXiv:2507.11181 , 2025. URL .11181 . [55] Xu Zhang, Dany ang Li, Xiaohang Dong, Tianhao W u, Hualong Y u, Jiany e W ang, Qicheng Li, and Xiang Li. Unic hange: Unifying change detection with m ultimo dal large language mo del. arXiv pr eprint arXiv:2511.02607 , 2025. [56] Y ang Zhang, Dan yang Li, Y uxuan Li, Xin Zhang, Tianyu Xie, Mingming Cheng, and Xiang Li. Crystal: Sp on taneous emergence of visual latents in mllms. arXiv pr eprint arXiv:2602.20980 , 2026. [57] Yiman Zhang, Ziheng Luo, Qiangyu Y an, W ei He, Borui Jiang, Xinghao Chen, and Kai Han. OmniEv al: A b enchmark for ev aluating omni-modal mo dels with visual, auditory , and textual inputs. arXiv pr eprint arXiv:2506.20960 , 2025. URL . [58] Lianmin Zheng, W ei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zh uang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with m t-b enc h and chatbot arena. arXiv pr eprint arXiv:2306.05685 , 2023. URL . [59] Xia wu Zheng, Y uexiao Ma, T eng Xi, Gang Zhang, Errui Ding, Y uc hao Li, Jie Chen, Y onghong Tian, and Rongrong Ji. An information theory-inspired strategy for automatic netw ork pruning. arXiv pr eprint arXiv:2108.08532 , 2021. [60] Junjie Zhou, Y an Shu, Bo Zhao, Boy a W u, Shitao Xiao, Xi Y ang, Y ongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: Benchmarking m ulti-task long video understanding. 2025 IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 13691–13701, 2025. URL https://openaccess.thecvf.com/cont ent /CV P R20 25/ htm l/Z hou _ML VU_ Ben c hma rki ng_ Mul ti- tas k_L o ng_ Vid eo_ Und ers tan din g_C VPR_ 202 5_p ape r.h tml . [61] Ziw ei Zhou, Rui W ang, and Zuxuan W u. Daily-Omni: T ow ards audio-visual reasoning with temp oral alignment across mo dalities. arXiv pr eprint arXiv:2505.17862 , 2025. URL . 15 App endix App endix T able of Conten ts App endix A A dditional Metho d Details for So cialOmni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.1 In ter-Annotator Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Domain and Sub category Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Option Balance in the Perception MCQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.4 Consistency Lab eling and Boundary Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.5 Generation Subset Size (209 Items) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.6 Q1 Step Size and T emporal Gran ularit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.7 Q1 Timing-Lab el Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.8 Delta-Windo w Binary Metrics for Q1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.9 Multiple References for Generation Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.10 Generation Judging Scop e and Visual Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.11 Judge Conﬁguration, Disagreement, and Tie Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.12 Mo dality Ablation Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.13 Repro ducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.14 Prompt T emplates and Parsing Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.15 Perception Results and Macro Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.16 Statistical Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.17 Human F eedback on a Challenging Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 A.18 Human F eedback Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.19 Representativ e F ailure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 16 A A dditional Metho d Details fo r So cialOmni A.1 Inter-Annotato r Agreement This subsection rep orts detailed inter-annotator agreement statistics. W e use raw p ercent agreement as the primary IAA metric. Percen t agreement is directly interpretable for our annotation types (4-wa y p erception lab els and binary consistency lab els), and all disagreements are resolved b y a senior reviewer via adjudication. W e release adjudication ﬂags and, where licensing p ermits, brief rationales and error categories alongside the ﬁnal lab els. Chance-corrected measures (e.g., Cohen’s κ ) are left for future work. A.2 Domain and Sub catego ry Definitions So cialOmni organizes 15 dialogue sub categories in to four domains used for b enchmark stratiﬁcation: en- tertainment , pr ofessional , daily life , and narr ative . Entertainment cov ers interactiv e media formats (talk sho ws, p o dcasts). Pr ofessional cov ers task-oriented or formal settings (interviews, debates). Daily life cov ers naturally o ccurring ev eryday conv ersations (family interactions, street dialogue). Narr ative cov ers scripted con versational scenes from mo vies and drama clips. Domain lab els are used for split balancing and per-domain analysis; exact source metadata and clip-level assignments are released sub ject to licensing constraints. A.3 Option Balance in the Perception MCQ The correct-option distribution across the 2,000 p erception items is A: 569, B: 561, C: 453, D: 417. This im balance reﬂects natural sp eaker prominence and camera-framing biases in real dialogue fo otage rather than annotation artifacts. W e delib erately a v oid artiﬁcial rebalancing, which would distort real-world statistics and in tro duce selection bias. T o ensure fairness, we rep ort p er-option accuracy and detailed confusion matrices, and stratify all analyses by domain, consistency split, and mo del type. A.4 Consistency Lab eling and Boundary Cases Eac h consistency lab el is assigned by one annotator, indep endently veriﬁed by a second reviewer, and adjudicated on disagreement. Reviewers must cite visible evidence at timestamp t (face visibilit y , clothing cues, on-screen p ositioning) to justify the lab el. Representativ e b oundary cases include: (i) the active sp eaker is partially visible (small on-screen area); (ii) the sp eaker is visible but identit y cues are w eak (heavy o cclusion); and (iii) reaction shots with oﬀ-screen sp eech. Illustrative examples app ear in the supplementary ﬁgures; lab el rationales and complete annotation guidelines are released where licensing p ermits. A.5 Generation Subset Size (209 Items) This subsection explains the size choice for the 209-item generation split. The generation subset is k ept relativ ely small to prioritize comparability ov er scale. Op en-ended dialogue contin uation annotation in tro duces v ariance along three axes: (1) decision-p oint selection, (2) reference quality , and (3) judge sensitivity . Scaling without tight control can amplify ev aluation noise and hinder fair cross-mo del comparison. Our proto col ﬁxes prompts, scoring rubrics, and judges, includes multi-reference calibration (§ A.9 ) to make v ariance explicit. A.6 Q1 Step Size and T emp o ral Granularit y This subsection explains the temp oral granularit y used for Q1. The p erception task uses frame-level timestamps (30 fps) to ev aluate who . F or streaming Q1, we query at a 1 s step as a compute-stable appro ximation of real-time turn-entry when b enchmarking many mo dels. This trades temp oral resolution for ev aluation cost. Our timing categories use multi-second windows (e.g., “p erfect” spans [ − 1 , 2] s), which reduces sensitivity to small step-size c hanges in t ypical settings. W e encourage future work to adopt ﬁner steps (e.g., 0.5 s) when compute p ermits and to rep ort step-size sensitivity explicitly . A.7 Q1 Timing-Lab el Mapping This subsection deﬁnes the timing lab els used for Q1 ev aluation. Let ∆ τ i = ˆ τ i − τ ⋆ i denote the resp onse oﬀset for item i . The timing label c i is assigned based on the resp onse oﬀset: resp onses are lab eled as Interr upted if ∆ τ i < − θ 1 , Perfect if − θ 1 ≤ ∆ τ i ≤ θ 2 , Dela yed if θ 2 < ∆ τ i ≤ θ 3 , TooLa te if ∆ τ i > θ 3 , and NoResponse if ˆ τ i = ∅ . The default thresholds are ( θ 1 , θ 2 , θ 3 ) = (1 , 2 , 5) s. 17 A.8 Delta-Windo w Binary Metrics fo r Q1 This subsection deﬁnes the binary metrics used for tolerance-window ev aluation in Q1. T o align with prior binary turn-taking formulations, we deﬁne the tolerance-window target at decision time t as: y ( δ ) 1 ,t = 1 { 0 < τ X − t ≤ δ } , (1) where τ X is the ground-truth turn-entry time of the candidate sp eaker. W e ev aluate w ith δ ∈ { 0 . 2 , 0 . 5 , 1 . 0 } s and rep ort: Prec = T P T P + F P , Rec = T P T P + F N , F1 = 2 Prec · Rec Prec + Rec . (2) Undeﬁned ratios are set to 0 for stable aggregation. These metrics are rep orted alongside oﬀset-based diagnostics. A.9 Multiple References for Generation Calib ration This subsection describ es the m ulti-reference calibration used for generation ev aluation. Dialogue contin uation is inherently multi-solution. F or a ﬁxed subset of K mr = 30 tasks, we collect m ultiple semantically equiv alent reference rewrites from annotators. These references calibrate judge tolerance to diverse v alid contin uations; w e rep ort score v ariance across references. A.10 Generation Judging Scop e and Visual Grounding This subsection clariﬁes what is cov ered by generation judging and how visual grounding is examined. The generation task targets interruption contin uation quality (appropriateness, coherence, pragmatics), which is primarily determined b y dialogue con text. Visual cues aﬀect who attribution and when decisions; these are prob ed via (1) the inconsistency split, (2) mo dalit y ablations, and (3) a visually grounded subset. W e provide a subset where the candidate resp onse is required to reference a visible even t or entit y; judges are instructed to p enalize hallucinated visual references. Performance on this subset is rep orted separately , and the corresp onding prompts are released. A.11 Judge Configuration, Disagreement, and Tie Statistics This subsection summarizes the generation judges, score interpretation, and disagreemen t statistics. The three judges are GPT-4o, Gemini 3 Pro, and Qwen3-Omni. Each judge outputs a single score in { 25 , 50 , 75 , 100 } under deterministic deco ding ( τ dec = 0 where supported) with ﬁxed prompts to minimize prompt and sampling v ariance. The score 100 denotes a ﬂuent resp onse that is grounded in context and pragmatically appropriate. A score of 75 is used for resp onses that are generally appropriate but remain somewhat generic or incomplete. A score of 50 indicates partial relev ance together with noticeable grounding or coherence problems. A score of 25 is assigned to resp onses that are irrelev ant, contradictory , ov erly generic, or pragmatically inappropriate. A coarse discrete scale is used to improv e stability and reduce judge v ariance. W e rep ort tie rates (fraction of samples with identical aggregated scores across mo dels) and rank-discrimination statistics (e.g., Kendall’s τ b et w een judge and aggregated rankings). A large-gap even t is deﬁned as | s ( a ) − s ( b ) | ≥ 20 , corresp onding to at least one near-step disagreemen t on the 4-lev el scale. The threshold of 20 (slightly b elow the 25-p oint step) is intended to capture near-step disagreemen ts while reducing sensitivity to minor judge calibration drift. W e rep ort large-gap frequency and its asso ciation with ambiguous contexts. A.12 Modality Ablation Implementation This subsection describ es ho w the mo dality ablations are implemented. W e preserve the original audio w av eform and replace the video stream with a static ﬁrst frame replicated at the original frame rate, keeping the input container and frame count unchanged while removing visual dynamics. W e preserve the original video frames and replace the audio wa v eform with zeros (silence), removing acoustic cues without altering video timing. 18 T able 5. So cialOmni p erception task results (2,000 items). ∆ cons = Acc cons − Acc incons . Mo del Ov erall (%) Cons. (%) Incons. (%) ∆ cons (%) GPT-4o [ 18 ] 36.75 38.14 28.00 + 10.1 Gemini 2.5 Pro [ 7 ] 44.69 45.88 37.23 + 8.7 Gemini 2.5 Flash [ 7 ] 47.03 48.52 37.59 + 10.9 Gemini 3 Flash Preview [ 12 ] 53.23 53.66 50.55 + 3.1 Gemini 3 Pro Preview [ 12 ] 64.99 66.24 57.04 + 9.2 Qw en3-Omni [ 50 ] 69.25 69.97 64.73 + 5.2 Qw en3-Omni-Thinking [ 50 ] 54.55 53.74 59.64 − 5.9 Qw en2.5-Omni [ 50 ] 36.75 36.46 38.55 − 2.1 OmniVinci [ 51 ] 15.15 15.01 16.00 − 1.0 VIT A-1.5 [ 10 ] 36.95 36.81 37.82 − 1.0 Baic huan-Omni-1.5 [ 25 ] 25.65 25.97 23.64 + 2.3 MiniOmni2* [ 48 ] 16.72 17.57 4.55 + 13.0 This design av oids out-of-distribution artifacts (e.g., random noise) while keeping the mo del interface identical, isolating each mo dality’s contribution. A.13 Rep ro ducibility This subsection lists the metadata and prompts released for repro duction. Sub ject to licensing constraints, w e release the metadata required to repro duce each sample: video identiﬁers/URLs where p ermitted, timestamp t , candidate sp eaker X for generation items, aligned transcript segments, consistency lab els, and adjudication ﬂags. Ev aluation prompts and judge prompts for the generation task are also released. A.14 Prompt T emplates and P arsing R ules This subsection summarizes the prompt templates and the corresp onding parsing constraints. W e use ﬁxed prompt cards with strict output parsing constraints to reduce prompt-induced v ariance across heterogeneous APIs and op en-source chec kp oin ts. F or who , only a single option letter in { A , B , C , D } is accepted. F or when , only unam biguous YES / NO outputs are accepted after normalization. F or how , non-empty contin uations are retained for judging; empty resp onses are counted as no-resp onse. So cialOmni Prompt Cards ( who – when – how ) Who (P erception). System: You are a precise video-audio reasoning assistant. You must answer ONLY with the option letter (A, B, C, or D). When (Q1 Decision). You are a conversation participant watching a video. Based on what you see, answer: Is it your turn to speak now? YES or NO. Ho w (Generation). You are another participant in this conversation. Watch the video carefully. When the other person finishes speaking and it is your turn, respond naturally in English. Do not interrupt while they are still speaking. A.15 P erception Results and Macro Metrics This subsection rep orts the detailed p erception results and the macro metrics used alongside accuracy . ∆ cons is a useful robustness indicator and should b e interpreted jointly with ov erall accuracy . Gemini 2.5 19 Pro and Gemini 3 Pro diﬀer in ov erall accuracy (44.69% vs. 64.99%) yet exhibit comparable consistency gaps ( + 8.7 and + 9.2), showing that higher absolute accuracy do es not eliminate split-sp eciﬁc brittleness. Qw en3-Omni-Thinking shows a negative gap ( − 5 . 9% ), i.e., low er accuracy on the consistent split than on the inconsistent split. This pattern is consistent with the p ossibility that more delib erative reasoning can in terfere with immediate cue in tegration in mismatch-hea vy scenes, so the result may reﬂect more than random v ariation. F or C = 4 p erception classes, macro-av eraged F1 is computed ov er the parsable subset P : F1 macro = 1 C C X c =1 2 Pr c Re c Pr c + Re c , (3) where Pr c and Re c are p er-class precision and recall. Non-parsable outputs are excluded from P but counted as incorrect for top-1 accuracy (deﬁned in Sec. 3.4 ). A.16 Statistical Definitions This subsection collects the statistical deﬁnitions used in the app endix analyses. W e ﬁrst deﬁne the generation aggregation: ¯ s i = 1 J J X j =1 s ( j ) i , J = 3 . (4) G = { i ∈ { 1 , . . . , N gen } : ˆ y 1 ,i = 1 ∧ ˆ s i  = ∅ } . (5) W e then deﬁne the cross-task asso ciation statistics: r = P m ∈M ( a m − ¯ a )( q m − ¯ q ) p P m ( a m − ¯ a ) 2 p P m ( q m − ¯ q ) 2 , (6) p = 1 + P B perm b =1 1  | r ( b ) | ≥ | r |  1 + B perm , (7) CI 0 . 95 ( r ) = h Q 0 . 025  { r ⋆ ( b ) }  , Q 0 . 975  { r ⋆ ( b ) }  i . (8) Finally , we summarize the ev aluation metrics used in the app endix: F or who , w e rep ort top-1 accuracy on the full set and on consistent/inconsisten t splits, with macro precision, recall, and F1 computed ov er the parsable subset P ; non-parsable outputs are treated as incorrect. F or when (Q1), we adopt a primary tolerance windo w δ = 0 . 2 s around each annotated b oundary and rep ort accuracy , precision, recall, and F1. T o examine timing biases, we decomp ose predictions into Early/On-time/Late (E/O/L) phases ov er the resp onded subset R = { i : ˆ τ i  = ∅ } : p E = |{ i ∈ R : ˆ τ i < τ ⋆ i − δ }| N R , p O = |{ i ∈ R : | ˆ τ i − τ ⋆ i | ≤ δ }| N R , p L = |{ i ∈ R : ˆ τ i > τ ⋆ i + δ }| N R , (9) with no-response rate p NR = 1 − N R / N gen . W e sw eep δ ∈ { 0 . 2 , 0 . 5 , 1 . 0 } s; full results appear in the supplemen tary material. F or how , each judge assigns s ( k ) i ∈ { 25 , 50 , 75 , 100 } under deterministic deco ding; the item-lev el aggregate is g i = 1 3 P 3 k =1 s ( k ) i , and the mo del-level score is ¯ g = 1 N speak P i g i , where N speak is the n um b er of items for whic h the mo del decides to sp eak. Unless stated otherwise, all conﬁdence interv als are 95% b o otstrap CIs with B =10 , 000 replicates [ 8 ] using the p ercentile metho d [ θ ∗ (0 . 025) , θ ∗ (0 . 975) ] . 20 Group Who (%) When Q1 (%) Ho w Q2 (/100) F ull b enchmark (model mean) 41.81 53.36 55.75 F ull b enchmark (best mo del) 69.25 66.99 85.08 Mo del mean on selected subset 1.25 12.50 0.00 Human on selected subset 72.50 80.00 55.15 T able 6. Human feedback on the selected subset, with full-b enchmark model references for scale. 0 20 40 60 80 100 Score Who When (Q1) How (Q2) 71.2 67.5 55.1 Full mean Best model Same-subset model mean Human Figure 6. Selected-subset human feedback across who / when / how . The pale interv al spans the full-b enchmark mean to the b est rep orted mo del; the dark connector links the same-subset mo del mean and selected-subset human score. A.17 Human F eedback on a Challenging Subset This subsection rep orts an additional human-feedbac k analysis without changing the claims in the main pap er. W e follow the same who / when / how axes as the b enchmark proto col. The veriﬁcation subset con tains 200 items for who , 200 items for when (Q1), and 50 judged items for how (Q2). Items were selected from cases on whic h curren t mo dels often fail. F or this reason, this subset is used to examine failure mo des rather than as an I ID substitute for the full b enchmark. The asymmetric size on how is inten tional. Ev aluating op en-ended dialogue resp onse quality requires careful judgmen t of empathy , grounding, and so cial appropriateness, so we use a smaller high-precision set. Each how item requires longer review and stricter consistency chec ks than binary who / when judgments; annotation eﬀort is thus allo cated to depth and quality con trol. T able 6 rep orts t w o references: (i) full-b enchmark mo del anchors and (ii) same-subset mo del/human comparison. F ull-b enc hmark rows serve only as scale anchors and are not used to estimate a h uman baseline on the full b enchmark. Human p erformance on the selected subset reac he s 72.50% on who , 80.00% on when , and 55.15/100 on how . The comparison is descriptive: even on cases that are hard for curren t mo dels, human p erformance remains higher.Figure 6 sho ws the same comparison with consistent color semantics across axes. T able 7 and Figure 7 summarize correlation statistics. The table rep orts P earson r and Sp earman ρ with p -v alues, 95% b o otstrap conﬁdence in terv als, and sample size n . The negative item-level asso ciation on when (h uman vs. ensemble) is interpreted as a signal of shal low heuristics : models ma y rely to o muc h on lo cal acoustic cues (e.g., brief pause-like gaps) and to o little on discourse-level completion cues used by h uman raters. F or the when mo del-level comparison, the aligned sample remains limited ( n = 11 ), so those co eﬃcients should b e interpreted as exploratory rather than deﬁnitive. 21 Setting Axis n Pearson r p r 95% CI ( r ) Spearman ρ p ρ 95% CI ( ρ ) Model-level (full vs subset) Who 12 0.5249 0.0797 [0.0811, 0.8799] 0.5388 0.0707 [0.0518, 0.9149] Model-level (full vs subset) When 11 -0.4332 0.4663 [-1.0000, 1.0000] -0.2000 0.7471 [-1.0000, 1.0000] Item-level (human vs ensem ble) Who 200 -0.2117 0.1898 [-0.5447, 0.1534] -0.2117 0.1898 [-0.5447, 0.1534] Item-level (human vs ensem ble) When 200 -0.4663 0.0382 [-0.7003, -0.1667] -0.4405 0.0519 [-0.6692, -0.1667] T able 7. Correlation statistics on the selected challenging subset. The negative item-lev el asso ciation on when reﬂects a divergence in whic h mo dels are often misled by deceptiv e acoustic cues. 1.0 0.5 0.0 0.5 1.0 Estimate with 95% CI Model-level Who n=12 Model-level When n=5 Item-level Who n=40 Item-level When n=20 B P e a r s o n r 1.0 0.5 0.0 0.5 1.0 Estimate with 95% CI S p e a r m a n Figure 7. Correlation estimates on the selected challenging subset. Each point sho ws the estimated correlation; horizon tal segments sho w the corresp onding 95% conﬁdence interv al. The dashed vertical line marks zero asso ciation. A.18 Human F eedback Discussion This subsection discusses ho w the human-feedbac k results should b e read. One ﬁnding is the negative item-lev el asso ciation b etw een h uman judgments and mo del ensem bles on the when axis (P earson r = − 0 . 4663 , p = 0 . 0382 ; Sp earman ρ = − 0 . 4405 , p = 0 . 0519 ). W e in terpret this pattern as evidence of shal low heuristic over-r elianc e : in c hallenging so cial scenes, current Omni-LLMs tend to trigger turn-taking decisions from the surface acoustic cues (e.g., brief silence-like gaps or lo cal energy drops), while human raters rely on higher-order semantic completion and pragmatic inten t. Because the selected subset inten tionally includes pseudo-silence cases (pauses inside an unﬁnished turn), the mo del decision tendency b ecomes in v ersely aligned with human-grounded timing labels. This result is consistent with the view that even competitive systems still lac ks stable audio visual-seman tic binding for distinguishing a pause-for-thought from true turn completion. The negativ e asso ciation is therefore consistent with a gap in social-temp oral reasoning that standard I ID-style ev aluation ma y sho w less clearly . A.19 Rep resentative Failure Cases This subsection presents representativ e qualitative examples for the three b enchmark axes. W e presen t three failure cases, one for each b enchmark axis. Each case links the b enchmark target, representativ e mo del outputs, and the failure pattern visible in the visual and textual evidence. F or who , Figure 8 shows a Lev el-1 item asking who sp eaks b etw een 0:05 and 0:07. The correct answer is the man on the right saying “That’s so true.” The ﬁgure combines a dominan t left-side close-up, a later right-side context frame, and the contin uous audio track. Under this combination of visual and audio cues, all chec ked mo dels fail: GPT-4o predicts option C, Qwen3-Omni predicts option A, and multiple Gemini v arian ts predict option D. The error pattern is that mo dels follow the most salient visible face rather than the audiovisual sp eaker binding. This case illustrates a saliency-driv en sp eaker-attribution error. F or when , Figure 9 presents a Level-2 Q1 item asking whether the w oman sp eaks at the 22nd second. The annotated answer is No. The ﬁgure shows adjacent frames from the same ongoing turn together with a pause-lik e wa veform gap, while the ASR remains semantically unﬁnished: “orange juice......a grap efruit...” GPT-4o is correct, but Gemini 3 Flash, Gemini 3 Pro, Gemini 2.5 Flash, and Gemini 2.5 Pro all answer Y es. The failure pattern is premature triggering: a brief acoustic gap is treated as turn completion even though the utterance is still pragmatically op en. This case reﬂects shallow silence-gap reasoning rather than more complete turn-completion mo deling. 22 Figure 8. Who failure case. The visually dominant frame fav ors the wrong sp eaker; The correct answer requires cross-mo dal sp eaker binding. The case illustrates a saliency-driven attribution error. Figure 9. When failure case. The wa veform con tains a pause-like gap, but the turn remains unﬁnished. The case illustrates premature turn triggering from shallow silence-gap cues. Figure 10. How failure case. The dialogue establishes an interpersonal context, but the resp onse remains generic rather than empathetic. The case illustrates context-response decoupling in generation. F or how , Figure 10 shows a Level-2 Q2 generation item. The scene is interpersonal, and the transcript centers on collateral, a guarantor, and the sp eak er’s discomfort with asking family for help. The reference contin uation is empathetic: “I understand your feelings ab out it, Richard.” In contrast, several mo dels pro duce generic problem-solving replies: GPT-4o (“W e need to ﬁnd another solution.”), Gemini 3 Flash (“But what are we going to do?”), Gemini 3 Pro (“But is there any other wa y?”), and Gemini 2.5 Pro (“There has to b e another w ay .”), all scored 0 by the LLM judge proto col. Only Gemini 2.5 Flash matches the reference and receiv es 100. This case shows context-response decoupling: recognizing the topic do es not ensure a grounded or so cially appropriate con tin uation. These examples are broadly aligned with the quan titative ﬁndings in the main pap er and the human-feedbac k app endix. Across the three axes, the errors include incorrect cross-mo dal iden tity binding for who , premature turn triggering for when , and a weakly grounded c on tinuation for how . 23

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment