Multi-Source Evidence Fusion for Audio Question Answering
Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech's solution to the Agent Track of the Interspeech 202…
Authors: Aivo Olev, Tanel Alumäe
Multi-Sour ce Evidence Fusion f or A udio Question Answering Aivo Olev , T anel Alum ¨ ae T allinn Uni v ersity of T echnology , Estonia { aivo.olev,tanel.alumae } @taltech.ee Abstract Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their inter- nal reasoning is lar gely opaque and dif ficult to v alidate. W e de- scribe T alT ech’ s solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are e val- uated on reasoning process quality , specifically the factual ac- curacy , logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text- only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding ev ery inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing sys- tems by a wide margin in challenge’ s reasoning quality metric. Index T erms : audio reasoning, reasoning quality , large audio language models, evidence combination, tool reliability 1. Introduction Audio understanding has adv anced rapidly with the emergence of large audio language models (LALMs) [1, 2, 3, 4]. Y et these models often produce opaque reasoning that resists ver- ification. This limitation is masked by benchmarks that evalu- ate only final-answer accuracy [5]. The Interspeech 2026 Au- dio Reasoning Challenge [6, 5, 7] addresses this gap: its tasks require combining perception with multi-step reasoning across music theory , speaker analysis, scene understanding, and tem- poral reasoning, and systems are ranked primarily on reasoning process quality (factuality , logic, and completeness of reason- ing chains), not accuracy alone. No single model excels at all of these capabilities, motiv ating agent-based systems that or- chestrate multiple specialized tools. This paper describes our entry to the Agent T rack, focusing on how we combine hetero- geneous audio tools and speech LLMs to produce both accurate and transparently reasoned answers. A central challenge for such systems is that their informa- tion sources hav e fundamentally different reliability character- istics. LALMs provide rich, high-level observations but are prone to hallucination: they may fabricate timestamps, report visual information from audio-only input, or overcount events. T raditional acoustic tools (e.g., beat detection, spectral analysis) produce reproducible measurements but can only answer nar- row questions and often struggle with real-world data that might contain noise, dif ferent types of sounds and out-of-domain au- dio ev ents. Automatic speech recognition occupies a middle ground: it is generally reliable but subject to errors. Therefore, it is dif ficult to build an agent that reasons correctly when com- bining e vidence from sources that span this reliability spectrum. Ensemble methods assume roughly equal participant reliabil- ity [8, 9], and agentic framew orks treat tools as equally reliable oracles [10, 11]. Our system incorporates three design decisions that proved effecti ve in the challenge setting: 1. Dual-source evidence fusion in which two speech LLMs in- dependently report observations and a downstream reasoning model cross-validates them. 2. A four -tier tool reliability framework with confidence caps, relev ance scoring, corroboration bonuses, and domain appropriateness adjustments for 25 tools. 3. A three-stage contradiction detection mechanism with hypothesis-dri ven tar geted verification. The system ranked first in the Agent Track according to the challenge’ s primary metric, reasoning quality (MMAR- Rubrics: 69.8), and achieved the second-highest accuracy (76.9%). Our evidence-based architecture encourages dense and v erifiable reasoning. Each claim is supported by reliability- tagged observations, tool measurements, or corroboration across sources, producing reasoning chains with many check- able factual statements (Section 5.3). Ablation experiments show that dual-source evidence fusion yields a statistically sig- nificant improv ement in accuracy . While the reliability weights and confidence caps were tuned for this specific challenge, the underlying architectural principles of evidence separation, tiered reliability , and con- tradiction driven verification may generalize to other reasoning tasks that in volve heterogeneous sources. 2. Related work LALMs have advanced rapidly [1, 12, 13, 14], yet persistent hallucination across 14 distinct failure modes [15] and strong affirmation bias [16] limit single-model reliability . Reinforce- ment learning improv es reasoning but not audio perception it- self [17], motiv ating multi-source approaches. Agentic frameworks such as ReAct [10] and T ool- former [11] treat tools as equally reliable oracles. Audio- specific agents, such as AudioGPT [18], AudioT oolAgent [19], and AudioGenie-Reasoner [20], coordinate multiple models but lack explicit reliability hierarchies or confidence caps. Sun et al. [21] sho wed that tools can produce plausible but incorrect re- sults without error signals, moti vating reliability-weighted com- bination. Multi-model ensemble methods, such as Mixture-of- Agents [8] and multi-agent debate [9], assume roughly homo- geneous participant reliability . Hwang et al. [22] demonstrated that explicit reliability modeling improv es rob ustness under het- erogeneous source quality . Our pipeline hides LALM answer T able 1: T ool r eliability tiers with repr esentative tools, default confidence caps, and evidence weights. Tier Representati ve tools Cap W eight Analytic Beat detection, energy dynam- ics, spectral features 0.90 1.0 Probabilistic Whisper ASR, diarization, source separation, instruments 0.75 0.75 Heuristic Chord analysis, en vironment detection, scene context 0.60 0.50 LALMs StepAudioR1, Qwen3-Omni 0.70 0.40 predictions from the reasoning agent as a precaution against an- choring on potentially incorrect predictions, though our ablation suggests the effect is small in this setting (Section 5.2). Sycophancy and anchoring bias cause LLMs to produce un- faithful reasoning when exposed to suggested answers [23, 24]. Our evidence-only design hides LALM final answer predictions from the reasoning agent to prev ent anchoring on potentially in- correct predictions. 3. System architectur e The system takes an audio file, a question, and multiple choice options as input and produces a selected answer with accompa- nying reasoning. Figure 1 illustrates the pipeline. T wo LALMs independently report observ ations about the audio. The reason- ing agent recei ves these observ ations together with tool outputs tagged with confidence and rele vance scores, but does not see the source models’ answer predictions. T wo open-weights LALMs — StepAudioR1 [14] and Qwen3-Omni [13] — independently analyze the audio. Each model receiv es four queries: one for the full audio and one for each of three equal-duration segments, enabling focused tem- poral analysis. Models are prompted to report observations, not to select an answer . The synthesis step merges observations across segments, applies corroboration bonuses when segment and full-audio analyses agree, and classifies the audio content type. A reasoning LLM call performs semantic corroboration of both speech LLM outputs, classifying each observation as cor- r oborated (both sources agree; confidence 0.80–0.95), sour ce- specific (one source only; 0.50–0.70), or disagr eement (conflict- ing claims with credibility assessment). W e classify each tool into four reliability tiers based on out- put determinism and reproducibility (T able 1). Each evidence item receives a confidence score combin- ing the tier’ s base confidence, a 1 . 5 × corroboration multiplier, and a 1 . 3 × direct-answer bonus. LALM evidence is capped at 0.70 regardless of bonuses. All caps, weights, and multipliers were set empirically during challenge development rather than learned from data. Confidence and relev ance are tracked inde- pendently; the argumentation stage weights evidence by their product. T ools applied outside their primary domain (e.g., beat detection on speech) recei ve a reduced confidence via a domain appropriateness factor . When the unified analysis reveals disagreements or insuf- ficient confidence, the system enters a two-step verification loop. In Step 1, the reasoning agent iterativ ely selects from 12 whole-audio tools (23 for music content). After gather- ing evidence, a three-stage contradiction detector processes the combined evidence: (1) heuristic reclassification adjusts LALM confidence based on keyw ord ov erlap with tool outputs ( ± 0 . 15 for reproducible tools); (2) hallucination risk assessment as- signs per-item risk levels based on reliability tier, corrobora- tion status, and domain-specific guards (e.g., cosine-clustering speaker counts ≥ 3 × the diarization estimate are marked as seg- mentation artifacts); (3) an LLM contradiction detector classi- fies inter-tool conflicts, intra-tool inconsistencies, and reliability hierarchy violations, while checking four logical pitfalls: treat- ing absence of detection as proof of absence, dismissing single- source claims instead of marking them speculativ e, diarization ov er-se gmentation, and transcription disagreements where seg- ment timings do not actually overlap. A non-dismissal policy keeps LALM claims flagged as speculativ e rather than hallu- cinated unless activ ely contradicted by reproducible tool evi- dence, preserving potentially valid observations for downstream weighing. Each unresolved contradiction generates a verifi- cation hypothesis with specific tool calls and time ranges for Step 2, where segment-specific tools examine targeted audio re- gions to resolve conflicts. The final stage uses two sequential LLM calls. An an- swer selection prompt presents the question, formatted audio observations (with source labels but without the source models’ answer predictions), reliability ev aluations with numeric confi- dence scores, and tool verification results with confidence and relev ance scores; the model selects the best-supported answer . A subsequent reasoning generation call receiv es the same evi- dence plus the pre-selected answer and produces prose justifica- tion following a sev en-section template (what is heard, evidence synthesis, conflict resolution, reliability assessment, tool cross- references, per-choice ev aluation, conclusion); a completeness check ensures all observations, tool results, and conflicts are addressed. Separating selection from elaboration prev ents the model from committing to a narrativ e before weighing all ev- idence and lets the reasoning model focus separately on only decision making and then on explanation. 4. Experimental setup 4.1. Dataset and metrics The Interspeech 2026 Audio Reasoning Challenge uses the 1,000-sample MMAR benchmark [7], spanning speech, mu- sic, and mixed-modality scenarios across signal, perception, semantic, and cultural reasoning layers. The challenge ev alu- ates tw o metrics: answer accuracy and reasoning quality via the MMAR-Rubrics protocol [5]. For each sample, k =5 check- able criteria are generated from the human-annotated reason- ing; an LLM judge assesses each criterion as satisfied or not and the rubrics score is the fraction of satisfied criteria. Only correct answers recei ve a reasoning score, while incorrect an- swers score zero regardless of reasoning quality . This two-gate design rew ards systems that both answer correctly and produce verifiable, evidence-grounded reasoning chains. The criteria as- sess factuality , logic, and completeness of the reasoning chain, directly re warding systems whose output traces each step from evidence to conclusion. 4.2. Implementation details The reasoning agent uses moonshotai/Kimi-K2-Thinking [25] (temperature 0.6). LALM evidence comes from stepfun- ai/Step-Audio-R1 [14] and Qwen/Qwen3-Omni-30B-A3B- Thinking [13]. ASR uses Whisper large-v3 [26] and Ca- nary [27]; source separation uses Demucs [28]; speaker di- arization uses pyannote.audio [29]; speaker embeddings use ECAP A-TDNN [30] via SpeechBrain [31]; pitch estimation Audio + Question + Choices StepAudioR1 (full + 3 seg.) Qwen3-Omni (full + 3 seg.) Unified Evidence Analysis Step 1: Ev- idence (12 [23] tools) Contradiction Detection Step 2: V alidate (5 [8] tools) Answer Selection Answer + Reasoning ≤ 3 × ≤ 2 × Figure 1: Multi-source ensemble pipeline. T wo speech LLMs analyze audio acr oss four se gments each; unified analysis identifies corr oborations and disagr eements. Step 1 gathers evidence fr om r eliability-tier ed tools (up to 3 rounds), three-stag e contradiction detection generates verification hypotheses, and Step 2 performs targ eted se gment-level validation. Bracket counts indicate additional music-specific tools. T able 2: Agent T rack leaderboar d (top 10) and single-model baselines from the MMAR benchmark [7]. Rubrics = MMAR- Rubrics composite score (ranking criterion); Acc. = raw accu- racy . Single-model baselines do not have Rubrics scores. Rank T eam / Model Rubrics Acc. Agent T rack 1 T alT ech (ours) 69.83 76.9% 2 T eam B 66.23 77.4% 3 T eam C 66.09 75.1% 4 T eam D 64.61 72.2% 5 T eam E 63.00 71.0% Commer cial and open-weight models (single) Gemini 2.5 Pro — 74.7% GPT -4o Audio — 63.5% Qwen3-Omni-Thinking — 66.4% T able 3: Accuracy by LALM agr eement level. Agreement N Correct Accuracy Unanimous 128 121 94.5% Majority 565 470 83.2% Conflicting 307 178 58.0% Overall 1000 769 76.9% uses CREPE [32]; audio event detection uses P ANNs [33]; acoustic features use librosa [34]. The end-to-end latency of our system averages 8–10 min- utes per sample, limiting real-time applicability . Outside a com- petition setting, much of this cost is a voidable: a non-reasoning LLM could likely handle most pipeline stages with minor ac- curacy loss, and many questions can be answered from speech LLM observations without inv oking multiple tool verification loops. 5. Results Our system obtained a reasoning score of 69.8 and 76.9% accu- racy , placing it first on the Agent Track leaderboard (T able 2). Descriptiv e analysis of the submission debug logs revealed the following observ ations: • Agreement predicts accuracy . T able 3 shows a 36.5 pp gap between unanimous (94.5%) and conflicting (58.0%) cases, confirming that inter-model agreement is a strong difficulty proxy . • Confidence is well-calibrated. Accurac y increases mono- tonically with confidence: 91.1% at ≥ 0.80 ( n =237 ), 74.4% at 0.60–0.79 ( n =722 ), and 39.4% at 0.40–0.59 ( n =33 ). • Corroboration improves accuracy . Samples with ≥ 6 cor- roborated items reach 86.2% ( n =282 ) vs. 53.8% with zero T able 4: P erformance by MMAR reasoning layer and sub- cate gory . %Max = reasoning quality given corr ect answers. Layer Sub-category N Acc. Reas. %Max Signal Acoustic Quality 18 88.9 84.1 94.6 Anomaly Detection 17 70.6 65.5 92.8 Audio Difference 8 37.5 25.8 68.9 Per ception Correlation 50 84.0 75.2 89.5 Counting & Stats 99 62.6 54.0 86.2 En viron. Perc. 149 82.6 75.5 91.4 Music Theory 63 61.9 49.6 80.2 Spatial 15 73.3 62.2 84.9 T emporal 28 57.1 50.5 88.3 Semant. Content 304 83.9 76.5 91.2 Emotion & Intent. 60 83.3 77.0 92.4 Speaker 48 85.4 77.6 90.9 Cultural Aesthetic Eval. 8 62.5 60.0 96.0 Culture of Speaker 52 76.9 63.0 81.8 Imagination 10 70.0 56.0 80.0 Professional Knowl. 71 66.2 57.8 87.2 Overall 1000 76.9 68.7 1 89.3 ( n =13 ). • T ool evidence changes 8.5% of answers , overriding both LALM predictions in 85/1,000 cases. Speech-only questions reach 85.7% while music drops to 58.7%. • Perf ormance varies across reasoning layers. T able 4 breaks down performance by MMAR question subcategories [7]. Semantic tasks are easiest (84.0%), while cultural reason- ing (70.2%) and signal-level tasks (72.1%) are hardest. Mu- sic theory (61.9%), temporal analysis (57.1%), and audio difference analysis (37.5%) are the most challenging sub- categories, aligning with the lo w usefulness of temporal and rhythm tools (Section 5.1). 5.1. T ool usage and usefulness T o assess per-tool usefulness, an LLM judge (Claude Opus 4 [36]) independently ev aluated each of the 1,000 pipeline de- bug logs, which contained the full pipeline state (source pre- dictions, tool in vocations, reliability evaluations, contradiction traces, and argumentation). For ev ery tool inv ocation the judge assigned a usefulness score (1–5) and classified its contrib ution type (direct answer , confirm/deny hypothesis, resolve conflict, or irrelev ant). Scores should be interpreted as indicativ e, as judge ratings were not validated ag ainst human annotations. T able 5 reports the resulting per-tool statistics across 1,000 1 The reasoning score (68.7) differs from our leaderboard score (69.8) due to running a single evaluation with an LLM judge, whose inherent variability produces minor score fluctuations across runs. T able 5: P er-tool statistics acr oss 1,000 evaluation samples (25 enabled tools, 21 observed), scor ed by an LLM judge. Sorted by averag e usefulness score (1–5). Contribution roles: Direct = direct answer , Confirm = confirm hypothesis, Deny = deny hypothesis, Resolve = r esolve conflict, Irrel. = not useful. Mus. = music-specific tool; Int. = LLM-interpreted output. T ool Model / T oolkit Mus. Int. Samples A vg Direct Confirm Deny Resolve Irrel. % Irrel. Speech LLM query (Qwen3) Qwen3-Omni 926 3.51 303 342 78 81 121 13.1% Transcription Canary [27] / Whisper lg-v3 602 3.46 233 181 61 4 122 20.3% Diarization + transcription Whisper lg-v3 + ECAP A-TDNN [30] 572 3.33 177 225 20 15 134 23.5% Speech LLM query (StepAudio) StepAudioR1 132 2.87 15 67 11 13 27 20.3% Melody transcription CREPE [32] + librosa Y Y 74 2.74 9 30 7 2 26 35.1% Instrument detection P ANNs [33] (CNN14) Y Y 213 2.35 8 92 24 4 85 39.9% Harmonic analysis librosa (H/P sep., chroma) Y 199 2.25 7 83 9 4 96 48.2% Beat & onset detection librosa (beat track, onset) Y Y 104 2.23 3 41 6 6 48 46.2% En vironment detection librosa (R T60) + heuristics 46 2.04 0 17 3 4 22 47.8% Synthetic speech detection AASIST [35] + prosody Y 5 2.00 0 0 1 2 2 40.0% Energy dynamics librosa (RMS, dynamic range) 784 1.99 6 235 58 20 465 59.3% Spectral features librosa (centroid, MFCC) 736 1.88 4 205 20 8 500 67.8% Audio quality librosa + pyloudnorm (LUFS) Y 17 1.77 2 3 1 0 11 64.7% Scene context librosa + P ANNs [33] + speech 63 1.44 0 17 1 0 45 71.4% Chord progression librosa (chroma cqt) + templates Y Y 6 1.33 0 0 1 0 5 83.3% Speaker count SpeechBrain [31] ECAP A-TDNN 394 1.26 1 23 12 2 355 90.3% Event sequence librosa + P ANNs AudioSet Y 264 1.26 2 22 3 2 235 89.0% T emporal segments librosa (segment comparison) Y 166 1.22 1 11 3 1 150 90.4% Audio effects librosa (rev erb, delay, EQ) Y 4 1.25 0 1 0 0 3 75.0% T empo tracking librosa (windowed beat track) Y Y 22 1.09 0 0 1 0 21 95.5% Rhythm analysis librosa (tempogram) + rules Y Y 8 1.00 0 0 0 0 8 100.0% Enabled but not observed during evaluation: Source separation Demucs htdemucs Y 0 – – – – – – – V ocal technique librosa (vibrato, registers) Y Y 0 – – – – – – – Instrument sequence P ANNs AudioSet (windowed) Y 0 – – – – – – – Rhythm patterns librosa (tempogram matching) Y Y 0 – – – – – – – T able 6: Ablation r esults (argumentation-level r eplay). ∆ = accuracy change vs. baseline in pp; N d = discordant pairs; p = p-value of McNemar’ test. Configuration Acc. ∆ N d p Baseline replay 76.6% — — — Step-Audio-R1 only 72.3% − 4 . 3 109 < .001 Qwen3-Omni only 73.4% − 3 . 2 110 .003 samples. Of 25 enabled tools, 21 were inv oked at least once. LALM queries and ASR tools rank highest in usefulness (avg. 3.3–3.5), while temporal and rhythm analysis tools show high irrelev ance rates ( > 89%), suggesting limited utility for the cur- rent question distribution. 5.2. Ablation: Multi-source e vidence fusion W e run argumentation-lev el replay ablations using debug logs from the final submission: all upstream artifacts (LALM obser- vations, tool outputs, contradiction traces) are fixed; only the evidence presented to the argumentation model changes. W e use McNemar’ s test with Holm–Bonferroni correction ( α/ 2 = 0 . 025 ) to assess statistical significance betweeen the results. T able 6 presents replay ablations testing dual- vs. single- source evidence. Both single-source conditions significantly degrade accurac y . 5.3. Discussion The ablation results suggest that multi-source fusion g ains come not from av eraging equally capable models, but from supple- menting a strong general-purpose model (Qwen3-Omni) with a specialist that captures complementary observations. The 85 cases where tool evidence overrides both speech LLMs fur- ther demonstrate that acoustic tools provide an independent ver- ification channel. The evidence-based architecture is particularly well-suited to the challenge’ s MMAR-Rubrics evaluation, which scores reasoning chains against instance-specific factuality criteria only when the answer is correct. T able 2 illustrates this align- ment: our system places first on Rubrics (69.83) despite slightly lower raw accuracy than T eam B (76.9% vs. 77.4%). Three architectural features likely contribute to the reasoning quality advantage noted in Section 1. First, multi-source evidence gath- ering produces reasoning dense in citable claims: each observ a- tion, tool measurement, and corroboration can satisfy a rubric criterion. Second, the sev en-section reasoning template (Sec- tion 3) enforces completeness by requiring that ev ery piece of evidence is referenced, conflicts are resolved, and each choice is evaluated. Third, using a strong reasoning model probably yielded good reasoning scores. The MMAR-Rubrics ev aluation criteria were rev ealed only after the end of the challenge [5]; these architectural choices were not optimized for the rubrics but happen to align with them. 6. Conclusion W e described our Interspeech 2026 Audio Reasoning Chal- lenge solution: an ensemble pipeline that manages unreliable audio observations through reliability-tiered evidence combi- nation, dual-source speech LLM fusion, and contradiction- driv en verification, placing first on reasoning quality at com- petitiv e accuracy on the MMAR benchmark. Ablations con- firm that dual-source fusion provides a statistically signifi- cant gain of 3.2–4.3 pp. Example reasoning outputs are av ailable at https://anonymous.4open.science/r/ audio- reasoning- solution- 3F43/ . Future work will in vestigate scaling beyond multiple-choice settings. 7. Acknowledgments This research was supported by the Estonian Centre of Excel- lence in AI (EXAI), National Program for Estonian Language T echnology Program (project EKTB104), both funded by the Estonian Ministry of Education and Research, and by the Esto- nian Language Data Research Infrastructure (KeT A). Some of the experiments were carried out in the T alT ech HPC [37]. 8. Generative AI use disclosur e Generativ e AI tools were used for editing and polishing the manuscript and for the post-hoc LLM-as-judge tool ev aluation reported in T able 5. All authors take full responsibility for the content of the paper . 9. References [1] Y . Chu, J. Xu, X. Zhou, Q. Y ang, S. Zhang, Z. Y an, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understand- ing via unified large-scale audio-language models, ” arXiv preprint arXiv:2311.07919 , 2023. [2] C. T ang, W . Y u, G. Sun, X. Chen, T . T an, W . Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: T ow ards generic hearing abilities for large language models, ” in International Conference on Learning Repr esentations (ICLR) , 2024. [3] Z. Kong, A. Goel, R. Badlani, W . Ping, R. V alle, and B. Catan- zaro, “ Audio flamingo: A novel audio language model with few- shot learning and dialogue abilities, ” in Proceedings of the 41st International Confer ence on Machine Learning (ICML) , v ol. 235. PMLR, 2024, pp. 25 125–25 148. [4] Y . Gong, H. Luo, A. H. Liu, L. Karlinsky , and J. Glass, “Listen, think, and understand, ” in International Conference on Learning Repr esentations (ICLR) , 2024. [5] Z. Ma, R. Xu, Y . Ma, C.-H. H. Y ang, B. Li, J. Kim, J. Xu, J. Li, C. Busso, K. Y u, E. S. Chng, and X. Chen, “The Interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents, ” arXiv pr eprint arXiv:2602.14224 , 2026. [6] “ Audio reasoning challenge – Interspeech 2026, ” https:// audio- reasoning- challenge.github .io/, 2026, based on the MMAR benchmark with Chain-of-Thought annotations. [7] Z. Ma, Y . Ma, Y . Zhu, C. Y ang, Y .-W . Chao, R. Xu, W . Chen, Y . Chen, Z. Chen, J. Cong et al. , “MMAR: A challenging bench- mark for deep reasoning in speech, audio, music, and their mix, ” in Advances in Neural Information Processing Systems (NeurIPS) , 2025. [8] J. W ang, J. W ang, B. Athiwaratkun, C. Zhang, and J. Zou, “Mixture-of-agents enhances large language model capabilities, ” in The Thirteenth International Confer ence on Learning Repre- sentations (ICLR) , 2025. [9] Y . Du, S. Li, A. T orralba, J. B. T enenbaum, and I. Mordatch, “Im- proving factuality and reasoning in language models through mul- tiagent debate, ” in Pr oceedings of the 41st International Confer- ence on Machine Learning (ICML) , 2024. [10] S. Y ao, J. Zhao, D. Y u, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models, ” in International Conference on Learning Representa- tions (ICLR) , 2023. [11] T . Schick, J. Dwiv edi-Y u, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T . Scialom, “T ool- former: Language models can teach themselves to use tools, ” in Advances in Neural Information Pr ocessing Systems , vol. 36, 2023. [12] Y . Chu, J. Xu, Q. Y ang, H. W ei, X. W ei, Z. Guo, Y . Leng, Y . Lv , J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-audio technical re- port, ” arXiv pr eprint arXiv:2407.10759 , 2024. [13] J. Xu, Z. Guo, H. Hu, Y . Chu et al. , “Qwen3-omni technical re- port, ” arXiv pr eprint arXiv:2509.17765 , 2025. [14] F . Tian, X. T . Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. W u, J. Chen, L. Zhao et al. , “Step-Audio-R1 technical report, ” arXiv pr eprint arXiv:2511.15848 , 2025. [15] J. Cheng et al. , “AHa-Bench: A comprehensive audio hallucina- tion benchmark for lar ge audio language models, ” arXiv pr eprint , 2025. [16] C.-Y . K uan, W .-P . Huang, and H.-y . Lee, “Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models, ” in Pr oc. Interspeech 2024 , 2024, pp. 4144–4148. [17] A. Rouditchenko, S. Bhati, E. Araujo, S. Thomas, H. Kuehne, R. Feris, and J. Glass, “Omni-R1: Do you really need audio to fine-tune your audio LLM?” arXiv preprint , 2025. [18] R. Huang, M. Li, D. Y ang, J. Shi, X. Chang, Z. Y e, Y . W u, Z. Hong, J. Huang, J. Liu, Y . Ren, Z. Zhao, and S. W atanabe, “ AudioGPT: Understanding and generating speech, music, sound, and talking head, ” in Proceedings of the AAAI Confer ence on Ar- tificial Intelligence , vol. 38, no. 21, 2024, pp. 23 802–23 804. [19] G. W ijngaard, E. Formisano, and M. Dumontier , “ AudioT oolA- gent: An agentic framework for audio-language models, ” arXiv pr eprint arXiv:2510.02995 , 2025. [20] Y . Rong, C. Li, D. Y u, and L. Liu, “AudioGenie-Reasoner: A training-free multi-agent frame work for coarse-to-fine audio deep reasoning, ” arXiv pr eprint arXiv:2509.16971 , 2025. [21] Y . Sun, E. Chang, and Y . Bisk, “Detecting silent errors in faulty tools, ” in Pr oceedings of the 2024 Confer ence on Empirical Meth- ods in Natural Language Pr ocessing (EMNLP) , 2024, pp. 14 272– 14 289. [22] J. Hwang, J. Park, H. Park, D. Kim, S. Park, and J. Ok, “Retrie val- augmented generation with estimation of source reliability , ” arXiv pr eprint arXiv:2410.22954 , 2024. [23] M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’ t always say what the y think: Unfaithful explanations in chain-of-thought prompting, ” in Advances in Neural Informa- tion Pr ocessing Systems (NeurIPS) , vol. 36, 2023. [24] M. Sharma, M. T ong, T . Korbak, D. Duvenaud, A. Askell, S. R. Bowman et al. , “T ow ards understanding sycophancy in language models, ” in International Conference on Learning Representa- tions (ICLR) , 2024. [25] Kimi T eam, “Kimi k2: Open agentic intelligence, ” arXiv preprint arXiv:2507.20534 , 2025. [26] A. Radford, J. W . Kim, T . Xu, G. Brockman, C. McLeav ey , and I. Sutskev er, “Robust speech recognition via large-scale weak su- pervision, ” in Pr oceedings of the 40th International Confer ence on Machine Learning (ICML) , vol. 202. PMLR, 2023, pp. 28 492–28 518. [27] K. C. Puvvada, Z. Huang, F . Jia, J. Balam, O. Hrinchuk, V . Lavrukhin, B. Ginsburg, and O. Kuchaie v , “Less is more: Ac- curate speech recognition & translation without web-scale data, ” arXiv pr eprint arXiv:2406.19674 , 2024. [28] S. Rouard, F . Massa, and A. D ´ efossez, “Hybrid transformers for music source separation, ” in ICASSP 2023 – IEEE International Confer ence on Acoustics, Speech and Signal Processing , 2023. [29] H. Bredin, “pyannote.audio 2.1 speak er diarization pipeline: prin- ciple, benchmark, and recipe, ” in Pr oc. INTERSPEECH 2023 , 2023. [30] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAP A- TDNN: Emphasized channel attention, propagation and aggrega- tion in TDNN based speaker verification, ” in Proc. Interspeech 2020 , 2020, pp. 3830–3834. [31] M. Rav anelli, T . Parcollet, P . Plantinga, A. Rouhe, S. Cornell, L. Lugosch et al. , “SpeechBrain: A general-purpose speech toolkit, ” arXiv pr eprint arXiv:2106.04624 , 2021. [32] J. W . Kim, J. Salamon, P . Li, and J. P . Bello, “CREPE: A con- volutional representation for pitch estimation, ” in ICASSP 2018 – IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , 2018, pp. 161–165. [33] Q. Kong, Y . Cao, T . Iqbal, Y . W ang, W . W ang, and M. D. Plumb- ley , “P ANNs: Large-scale pretrained audio neural networks for audio pattern recognition, ” IEEE/ACM T ransactions on Audio, Speech, and Langua ge Pr ocessing , vol. 28, pp. 2880–2894, 2020. [34] B. McFee, C. Raffel, D. Liang, D. P . W . Ellis, M. McV icar, E. Bat- tenberg, and O. Nieto, “librosa: Audio and music signal analysis in Python, ” in Proceedings of the 14th Python in Science Confer- ence , 2015, pp. 18–24. [35] J.-w . Jung, H.-S. Heo, H. T ak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Y u, and N. Evans, “ AASIST : Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks, ” in ICASSP 2022 – IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , 2022, pp. 6367–6371. [36] Anthropic, “Claude: Anthropic’ s AI assistant, ” https://www . anthropic.com/claude, 2025. [37] H. Herrmann, T . Kaevand, and L. Anton, “B ASE: T alT ech’ s HPC infrastructure 2020–2024, ” T alT ech Data Repository , Mar . 2025.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment