Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

CAN HIERARCHICAL CR OSS-MOD AL FUSION PREDICT HUMAN PERCEPTION OF AI DUBBED CONTENT? Ashwini Dasar e, Nirmesh Shah, Ashishkumar Gudmalwar , P ankaj W asnik Sony Research India Email: { nirmesh.shah,ashish.gudmalwar1,pankaj.w asnik } @sony .com ABSTRA CT Evaluating AI generated dubbed content is inherently multi- dimensional, shaped by synchronization, intelligibility , speaker consistency , emotional alignment, and semantic context. Hu- man Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. W e present a hierar- chical multimodal architecture for perceptually meaningful dubbing ev aluation, integrating complementary cues from au- dio, video, and text. The model captures ﬁne-grained features such as speaker identity , prosody , and content from audio, f a- cial expressions and scene-lev el cues from video and seman- tic context from text, which are progressiv ely fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-ef ﬁcient ﬁne-tuning across modalities. T o ov ercome limited subjective labels, we deriv e proxy MOS by aggregating objecti ve metrics with weights optimized via activ e learning. The proposed architecture was trained on 12k Hindi–English bidirectional dubbed clips, follo wed by ﬁne-tuning with human MOS. Our approach achie ves strong perceptual alignment (PCC > 0.75), providing a scalable solution for automatic ev aluation of AI-dubbed content. Index T erms — AI Dubbing ev aluation, Active learning, hierarchical fusion, multimodal, proxy MOS, MOS 1. INTR ODUCTION AI-based dubbing has advanced rapidly with progress in neu- ral machine translation (NMT), text-to-speech (TTS), and audio-visual (A V) synchronization [1, 2]. Despite these de- velopments, assessing the quality of dubbed content remains an open problem [3]. Current e valuation methods focus on isolated dimensions such as, speech naturalness, intelligibil- ity , or A V synchrony , but these do not capture the o verall perceptual quality [4]. In practice, human ev aluators judge dubbed content in a multifaceted way [5 – 7], simultaneously considering prosody , speaker identity , semantic consistency , emotional congruence, and temporal alignment. The overall quality of AI-dubbed content from this holistic perspecti ve is still not well characterized, making it difﬁcult to compare systems or optimize to ward natural and coherent dubbing [8]. A key challenge is the lack of scalable human ratings. Mean Opinion Scores (MOS) remain the gold standard, but collecting them is costly , time-consuming, and infea- sible at the scale required for modern dubbing systems [9]. T o ov ercome this bottleneck, we introduce Proxy MOS, a weak-supervision strategy that aggregates multiple objectiv e metrics into pseudo-perceptual labels. Unlike simple a verag- ing, weights are adapti vely learned through an active learning scheme guided by a small subset of human ratings, improving alignment between proxy labels and human perception. T o model the multifaceted nature of dubbing quality , we design a hierarchical multimodal architecture that integrates audio, video, and text features through progressive intra and inter-modal fusion. Fine-grained cues from audio (speaker identity , prosody), text (semantic intent), and video (charac- ter and af fective context) jointly inﬂuence percei ved dubbing quality , moti vating multimodal modeling. Pre-trained models serve as encoders for each modality , and lightweight LoRA adapters are employed to project representations into a shared space, enabling parameter efﬁcient adaptation across modal- ities. Intra-modal fusion aligns complementary cues within each modality , while inter-modal fusion captures dependen- cies across modalities, ensuring a holistic representation of dubbing quality . In this paper , we propose a two-stage training pipeline. In the ﬁrst stage, the model is pre-trained on Proxy MOS la- bels deri ved from over 12,000 dubbed clips, generated using state-of-the-art NMT , TTS, and audio-visual synchronization pipelines applied to the publicly available MELD [10] and M2H2 [11] datasets. In the second stage, the model is ﬁne- tuned on a smaller set of human MOS ratings to reﬁne percep- tual alignment. Our contributions are summarized as follo ws: • Propose a hierarchical multimodal framework that fuses audio, visual, and textual features through intra and inter -modal fusion layers for automatic perceptual ev aluation of AI dubbed content. • Incorporate lightweight LoRA adapter layers for parameter- efﬁcient adaptation across modalities. • Propose two-stage training pipeline (1) an activ e learn- ing–based weak supervision framew ork that deri ves Proxy MOS from multiple objective metrics and (2) ﬁnetuning by human-rated MOS for stronger percep- tual alignment. Content Encoder Speaker Encoder Emotion Encoder T emporal Encoder FER Encoder Video Content Audio Content Intra-modal Fusion Intra-modal Fusion Inter-modal Fusion Input Video Trainable Layers Frozen Layer Regression Head T ext Content Semantic Encoder DubScore Prediction LoRa LoRa LoRa LoRa LoRa LoRa Fig. 1 : Block diagram of Proposed hierarchical cross-modal fusion architecture. 2. PR OPOSED METHOD 2.1. Propose Hierar chical Multimodal Architectur e W e propose a hierarchical multimodal network designed to capture both modality-speciﬁc cues and cross-modal depen- dencies critical for perceptual dubbing ev aluation, as shown in Figure 1. The architecture lev erages state-of-the-art pre- trained encoders for audio, video, and text, each selected for their demonstrated ability to e xtract semantically and percep- tually rich representations. For video encoders, we extract visual embeddings using the T imeSformer transformer-based video representation model [12], which captures long-range spatio-temporal dependencies through divided space-time at- tention. The output embeddings are 768-dimensional. In ad- dition, to incorporate facial af fective cues, we also employ a Facial Expression Recognition (FER) encoder [13, 14], which maps cropped face sequences into 512-dimensional emotion- related features. F or audio encoders, we use W av2V ec2.0 [15], which outputs 768-dimensional frame-lev el embed- dings that are temporally pooled and linearly projected to 256 dimensions before fusion. T o capture speaker traits and paralinguistic cues, we adopt ECAP A-TDNN [16], yielding 192-dimensional speaker embeddings. Moreover , we include Emo2V ec [17], a deep emotional embedding extractor for speech, producing 256-dimensional representations. Finally , for the te xt encoder , we use the Sentence-BER T encoder [18], producing 768-dimensional embeddings that capture seman- tic similarity at the sentence le vel. T o efﬁciently adapt these encoders, we employ Lo w-Rank Adaptation (LoRA) [19], which introduces lightweight trainable matrices into atten- tion and projection layers while keeping pretrained weights frozen. This reduces the number of trainable parameters while maintaining ﬂexibility . The hierarchical design reﬂects ho w human ev aluators perceiv e dubbed content, ﬁrst by consolidating heteroge- neous cues within each modality , and then integrating them across modalities. For instance, audio encoders pro vide speech content, speaker identity , and prosodic cues, while video encoders capture both global context and ﬁne-grained facial expressions. Fusing these directly risks information loss or modality dominance, hence the need for staged fu- sion. Formally , giv en embeddings { h ( i ) m } N m i =1 for modality m ∈ { audio , video , text } , we adapt them via: ˜ h ( i ) m = P m ( h ( i ) m ) + A m ( h ( i ) m ) , (1) where P m is a linear projection and A m is a LoRA adapter . Intra-modal fusion aggregates modality-speciﬁc cues using attention-based gating, z m = N m X i =1 α ( i ) m ˜ h ( i ) m , α ( i ) m = exp( w ⊤ ˜ h ( i ) m ) P j exp( w ⊤ ˜ h ( j ) m ) . (2) Inter -modal fusion: T o account for modality reliability , we apply gating before cross-modal modeling: ˆ z m = g m ( z m ) z m , g m ( z m ) = exp( ϕ ( z m )) P m ′ exp( ϕ ( z m ′ )) . (3) The gated modality-lev el vectors are concatenated and pro- cessed by a 3-layer , 4-head transformer encoder: H = T ransformerEncoder ([ ˆ z audio ; ˆ z video ; ˆ z text ]) . (4) Finally , a regression head predicts a single perceptual dubbing score (DubScore) using an L2 (MSE) loss. This hier - archical strategy consolidates intra-modal cues before cross- modal fusion, ensuring both modality-speciﬁc reﬁnement and robust interaction aligned with perceptual judgments. 2.2. Active Lear ning for Pr oxy MOS Human MOS annotations are limited and costly to obtain, making it infeasible to label the entire dataset. T o ov ercome this, we generate Proxy MOS labels by aggregating multiple objectiv e metrics, including PEA VS [20] for audio–visual synchrony , EmoSync [21] for visual emotion alignment, LogF0RMSE for speaker consistency , UTMOS [22] for speech quality , and SpeechBER T Score [23] for semantic alignment. The aggregated Proxy MOS is deﬁned as: Proxy MOS = X i w i · O i , (5) where O i denotes the i -th objecti ve metric and w i its learnable weight. The challenge is to estimate w i using only a limited set of human MOS labels. T o this end, we employ an acti ve learning (AL) strate gy that incrementally reﬁnes the weights in three stages. Importantly , the reported percentages (33%, 66%, 100%) refer to fractions of the labeled training subset , not the entire corpus, consistent with standard AL practice where human annotations are scarce. Stage I: Initialization. A small labeled subset (33% of av ailable MOS annotations) is used to estimate initial weights by maximizing the correlation between Proxy MOS and hu- man MOS. This provides a ﬁrst approximation of perceptual alignment. Stage II: Uncertainty- and Div ersity-Based Sampling . T o expand the labeled pool, we query the most informativ e samples from the remaining unlabeled subset. Samples are ranked by prediction uncertainty (entropy of Proxy MOS) and then ﬁltered to maximize diversity across dubbing conditions (e.g., clip length, acoustic background, speaker identity). This hybrid uncertainty–di versity criterion ensures that anno- tations are both informati ve and representati ve. An additional 33% of labeled data is added in this stage (total 66%). Stage III: Reﬁnement. The process is repeated to reach 100% of the labeled subset. W e then optimize the weights w i by maximizing Pearson correlation with the expanded human MOS set: max w ρ  ProxyMOS ( w ) , HumanMOS  , (6) where ρ denotes Pearson correlation. Proxy-MOS weights are learned by maximizing Pearson correlation with human MOS (Eq. 6), while acti ve learning selects informati ve clips based on uncertainty and diversity . The learned weights are used to compute Proxy MOS labels for the full 12k dubbed clips, which provide weak supervision for training the hierarchical multimodal netw ork at scale. The model is then ﬁne-tuned on a limited set of human MOS annotations, yielding a percep- tually aligned predictor . 3. EXPERIMENT AL RESUL TS AND DISCUSSION 3.1. Datasets and Experimental Setup W e perform our ev aluation on two publicly a vailable datasets, namely MELD [10] and M2H2 [11]. MELD (English) was dubbed into Hindi, and M2H2 (Hindi) into English. Both datasets contain video clips, along with speaker tags, emo- tion tags and transcripts. For creative translation, we used the Gemini-9B model, prompting it with speaker attributes and emotion tags to preserv e both semantic and af fectiv e con- text. T ranslated utterances were synthesized into the target language using the F5-TTS system for the generation of ex- pressiv e speech. Finally , to ensure audio-visual synchrony , we applied a global time-stretching algorithm that aligns syn- thesized audio with the original video. This pipeline pro- duced approximately 6k dubbed clips from MELD and 4k from M2H2, covering div erse speakers and emotions. Ad- ditionally , we included 2k original ground-truth clips to sta- bilize Proxy MOS training, resulting in a total of 12k clips Details of pre-trained encoders are provided in Section 2.1. For adaptation, LoRA modules were attached to each encoder with a rank of r = 16 , empirically chosen to balance per- formance and parameter efﬁcienc y . T raining was performed using the Adam optimizer with a learning rate of 1e-4, batch size 64, dropout 0.2, for 50 epochs. All results are reported using 4-fold cross-validation to ensure rob ustness.. 3.2. Subjective Ev aluation For subjecti ve ev aluation, we recruited 30 participants (aged 25–40, no reported hearing or vision impairments). A total of 1350 ratings were collected, with each participant e valuating 45 dubbed video clips. In addition to overall dubbing qual- ity , participants rated six rubric-based aspects aligned with established dubbing ev aluation standards [24]: audio–visual synchrony , speaker consistency , voice clarity , emotional ex- pressiv eness, prosody in speech, and overall dubbing qual- ity . The human-rated MOS set w as di vided into an 80%–20% train–test split for ﬁne-tuning the proposed network. T o fur- ther assess reliability of human-rated MOS, we conducted an inter-listener agreement analysis (as shown in T able 1). The results sho w good internal consistency (Cronbach’ s α = 0 . 82 ). The intraclass correlation coefﬁcients indicate moder- ate reliability , with ICC1 = 0.69 (individual ratings) and ICC2 = 0.59 (aggregated MOS). T able 1 : Inter-listener agreement metrics. Metric V alue Cronbach’ s Alpha 0.82 ICC1 (One-way random) 0.69 ICC2 (T wo-way random) 0.59 3.3. Experimental Results W e ev aluate models using Pearson’ s correlation coef ﬁcient (PCC), Spearman’ s rank correlation coef ﬁcient (SRCC), and mean squared error (MSE) between the predicted DubScore and the human-rated MOS. MOS labels ([1,5]) are normalized during training and rescaled at inference. T able 2 reports the performance of the proposed network under different modal- ity conﬁgurations, including unimodal, bimodal, and the full multi-modal setting. This ablation highlights the individual and complementary contributions of each modality . Among single modalities, audio pro vides the strongest predicti ve sig- nal, text of fers moderate performance, and video contributes little when used alone, especially in complex scenes with limited visual identity cues. Bimodal fusion impro ves results, with audio–text outperforming other pairs. The proposed full multi-modal system achieves the best alignment with human perception, conﬁrming that hierarchical integration of all modalities provides complementary beneﬁts beyond unimodal or bimodal systems. T able 2 : Performance comparison of the proposed network under dif ferent modality conﬁgurations. Here, Audio (A), V ideo (V) and T ext (T) represents considered modalities. Modality PCC ↑ SRCC ↑ MSE ↓ A 0.68 0.60 4.30 V 0.05 0.01 3.84 T 0.34 0.43 3.84 A+V 0.71 0.65 3.88 A+T 0.73 0.76 4.39 V+T 0.50 0.54 3.77 A+V+T 0.76 0.77 3.88 Since Proxy MOS relies on learning weights for objecti ve measures, we employ acti ve learning (AL) to make this pro- cess data-efﬁcient. A random-sampling baseline is used for comparison to isolate the beneﬁt of AL. W e adopt a three- stage setup, where labeled data is incrementally expanded from 33% to 100% of the training set, with the most infor- mativ e samples selected at each step. T able 3 shows that cal- ibration steadily improves as label budgets gro w . For exam- ple, A verage Predictiv e V ariance (APV) decreases, Prediction Interval Coverage Probability (PICP) increases, Mean Predic- tion Interval Width (MPIW) tightens, and Expected Calibra- tion Error (ECE) reduces together indicating more reliable weight estimation. In addition, T able 4 compares AL with random sampling, showing consistent gains in PCC, SRCC, predictiv e accuracy (Mean Squared Error (MSE)), and ex- plained variance ( R 2 ), with signiﬁcant improv ements at full budget. These results justify the use of AL for Proxy MOS weight learning, which then serves as weak supervision for our hierarchical multimodal network. T able 3 : Uncertainty calibration metrics at dif ferent label pro- portions. #Labeled APV ↓ . PICP% ↑ MPIW% ↓ ECE ↓ 33% 0.51 78% 0.82 0.14 66% 0.26 82% 0.65 0.09 100% 0.16 86% 0.58 0.06 T able 4 : Performance comparison of Acti ve Learning (AL) vs. Random (Ra) sampling. P-values indicate the statistical signiﬁcance of improv ements of AL over the corresponding Random baseline at the same label budget. Baseline rows (Ra) are not applicable. Strategy PCC ↑ SRCC ↑ R 2 ↑ p-value Ra (33%) 0.68 0.67 0.46 – AL (S0, 33%) 0.71 0.69 0.50 0.18 Ra(66%) 0.73 0.71 0.55 – AL (S1, 66%) 0.77 0.75 0.61 0.07 Ra (100%) 0.76 0.74 0.62 – AL (S2, 100%) 0.82 0.81 0.69 0.03 W e also considered equal-weights strate gy as a base- line in proxy mos calculation. T able 5 reports the results. While equal weighting provides a reasonable baseline, AL- based weak supervision consistently yields higher corre- lations (PCC, SRCC) and lower errors (MSE). Moreov er , when combined with ﬁne-tuning on human-rated MOS, AL achiev es the overall best performance. These ﬁndings con- ﬁrm that AL-based proposed Proxy MOS is more effecti ve than simple av eraging and that its combination with human supervision pro vides the most reliable perceptual predictions. T able 5 : Performance comparison across different training strategies. Here, EW = equal weight, WS = weak supervision with Proxy Mos, and FT = ﬁne-tuning with human MOS. Proxy MOS Strategies PCC ↑ SRCC ↑ MSE ↓ EW : WS 0.22 0.25 8.14 EW : WS + FT 0.35 0.33 5.14 AL: WS 0.68 0.67 2.96 AL: WS + FT 0.76 0.77 2.70 The radar chart (Fig. 2) shows that individual objective metrics yield lo w PCC and SRCC scores when predicting ov erall dubbing quality . In contrast, the proposed architec- ture achie ves consistently higher correlations, demonstrating its effecti veness as a uniﬁed perceptual e valuation metric. A VSync EmoSync UTMOS AutoPCP LogF0RMSE SpeechBER T Proposed-Method 0.2 0.4 0.6 0.8 1.0 PCC SPCC Fig. 2 : Effecti veness in predicting ov erall dubbing quality scores of different objecti ve metrics 4. CONCLUSION This paper presents a hierarchical multimodal architecture for dubbing quality assessment that fuses audio, video, and text cues through intra- and inter-modal layers, achieving strong alignment with human perception. An adapti ve acti ve learn- ing strategy with parameter-ef ﬁcient LoRA ﬁne-tuning en- ables scalable training using proxy MOS with limited hu- man annotations. Experimental results demonstrate scalable, perceptually aligned dubbing quality prediction and establish a foundation for future work in automated, human-centered audio-visual quality assessment. 5. REFERENCES [1] Y ihan W u et al., “V ideodubber: Machine translation with speech-a ware length control for video dubbing, ” in AAAI , 2023, vol. 37, pp. 13772–13779. [2] Neha Sahipjohn, Ashishkumar Gudmalwar , Nirmesh Shah, Pankaj W asnik, and Rajiv Ratn Shah, “Dub- wise: V ideo-guided speech duration control in multi- modal llm-based text-to-speech for dubbing, ” in INTER- SPEECH , K os Island, Greece, 2024. [3] Giselle Spiteri Miggiani, “Rethinking creativity in dub- bing: Potential impact of ai dubbing technologies on creativ e practices, roles and vie wer perceptions, ” T rans- lation Spaces , 2025. [4] Chaoyi W ang et al., “T ow ards ﬁlm-making production dialogue, narration, monologue adaptive moving dub- bing benchmarks, ” arXiv , 2025. [5] Huriye Atilgan et al., “Integration of visual informa- tion in auditory cortex promotes auditory scene analysis through multisensory binding, ” Neur on , vol. 97, no. 3, pp. 640–655, 2018. [6] Zahid Akhtar and Tiago H Falk, “ Audio-visual multime- dia quality assessment: A comprehensiv e survey , ” IEEE access , vol. 5, pp. 21090–21117, 2017. [7] Chuanji Gao et al., “ Audiovisual integration in the hu- man brain: a coordinate-based meta-analysis, ” Cerebr al Cortex , v ol. 33, no. 9, pp. 5574–5584, 2023. [8] Laurena Bernabo, “How , when, and why to use ai: Strategic uses of professional perceptions and industry lore in the dubbing industry , ” International Journal of Communication , vol. 19, pp. 18, 2025. [9] G ´ erard Bailly , Elisabeth Andr ´ e, Erica Cooper, Benjamin Cow an, Jens Edlund, Naomi Harte, Simon King, Es- ther Klabbers, S ´ ebastien Le Maguer , Zoﬁa Malisz, et al., “Hot topics in speech synthesis ev aluation, ” in Pr oc. SSW 2025 , 2025, pp. 1–7. [10] Soujanya Poria, Dev amanyu Hazarika, Nav onil Ma- jumder , Gautam Naik, Erik Cambria, and Rada Mihal- cea, “Meld: A multimodal multi-party dataset for emo- tion recognition in con versations, ” in ACL , 2019, pp. 527–536. [11] Dushyant Singh Chauhan et al., “M2h2: A multimodal multiparty hindi dataset for humor recognition in con- versations, ” in ACM ICMI , 2021. [12] Gedas Bertasius, Heng W ang, and Lorenzo T orresani, “Is space-time attention all you need for video under- standing?, ” in ICML , 2021. [13] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, “ Arcface: Additi ve angular margin loss for deep face recognition, ” in CVPR , 2019. [14] Shan Li and W eihong Deng, “Deep facial expression recognition: A survey , ” IEEE T ransactions on Affective Computing , 2023. [15] Alexei Bae vski, Y uhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framew ork for self- supervised learning of speech representations, ” in Ad- vances in Neural Information Pr ocessing Systems , 2020, vol. 33, pp. 12449–12460. [16] Brecht Desplanques, Jenthe Thienpondt, and Kris De- muynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn-based speaker ver- iﬁcation, ” in Interspeech , 2020. [17] Ziyang Ma, Zhisheng Zheng, Jiaxin Y e, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen, “emo- tion2vec: Self-supervised pre-training for speech emo- tion representation, ” in F indings of the A CL , 2024, pp. 15747–15760. [18] Nils Reimers and Iryna Gure vych, “Sentence-bert: Sen- tence embeddings using siamese bert-networks, ” in EMNLP , 2019. [19] Edward J. Hu, Y elong Shen, Phillip W allis, et al., “Lora: Low-rank adaptation of large language models, ” arXiv pr eprint arXiv:2106.09685 , 2021. [20] Lucas Goncalves, Prashant Mathur , Chandrashekhar La- vania, Metehan Cekic, Marcello Federico, and Kyu J Han, “Perceptual ev aluation of audio-visual synchron y grounded in vie wers’ opinion scores, ” in Eur opean Con- fer ence on Computer V ision . Springer , 2024, pp. 288– 305. [21] Elena Ryumina, Maxim Markitantov , Dmitry Ryumin, Heysem Kaya, and Alexe y Karpov , “Zero-shot audio- visual compound expression recognition method based on emotion probability fusion, ” in CVPR W orkshops , 2024, pp. 4752–4760. [22] T akaaki Saeki, Detai Xin, W ataru Nakata, T omoki K o- riyama, Shinnosuk e T akamichi, and Hiroshi Saruwatari, “Utmos: Utok yo-sarulab system for v oicemos challenge 2022, ” in Interspeech , 2022, pp. 4521–4525. [23] T akaaki Saeki, Soumi Maiti, Shinnosuke T akamichi, Shinji W atanabe, and Hiroshi Saruwatari, “Speech- BER TScore: Reference-aware automatic ev aluation of speech generation le veraging nlp e valuation metrics, ” in Interspeech 2024 , 2024, pp. 4943–4947. [24] Giselle Spiteri Miggiani, “Quality assessment tools for studio and ai-generated dubs and voice-o vers, ” 2024.

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment