The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era

The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly human-like'' communication necessitates a dual capability: emotional intelligence to perceive and resonate with users' emotional states, and robust interaction mechanisms to navigate the dynamic, natural flow of conversation, such as real-time turn-taking. Therefore, we launched the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026 to benchmark these dual capabilities. Anchored by a sizable dataset derived from authentic human conversations, this initiative establishes a fair evaluation platform across two tracks: (1) Emotional Intelligence, targeting long-term emotion understanding and empathetic generation; and (2) Full-Duplex Interaction, systematically evaluating real-time decision-making under listening-while-speaking’’ conditions. This paper summarizes the dataset, track configurations, and the final results.


💡 Research Summary

The paper introduces the first Human‑like Spoken Dialogue Systems Challenge (HumDial) held at ICASSP 2026, aiming to benchmark two fundamental capabilities required for truly human‑like interaction in the era of large language models (LLMs): emotional intelligence and full‑duplex interaction. Leveraging a large corpus of authentic multi‑turn conversations, the challenge defines two tracks.

Track I (Emotional Intelligence) evaluates systems on three tasks: (1) Emotion Trajectory Detection – identifying and summarizing emotional changes across 3‑5 turn dialogues; (2) Emotional Reasoning – inferring the causal triggers behind a user’s emotions; and (3) Empathy Generation – producing empathetic responses in both text and speech. The dataset is built by first prompting Gemini 2.5‑pro to generate coherent user thought flows, then having professional actors record the scripts. Six balanced emotion categories are covered. Evaluation combines automated scoring by Qwen3‑Omni‑30B (for trajectory, reasoning, and textual empathy) with human assessment (20 annotators, split evenly between Chinese and English) of emotional appropriateness and audio naturalness for the speech output. The final score is a weighted sum: 0.2 × T1 + 0.2 × T2 + 0.1 × textual empathy + 0.25 × emotional appropriateness + 0.25 × audio naturalness.

Track II (Full‑Duplex Interaction) focuses on real‑time decision‑making when a system must listen while speaking. Two scenarios are defined: Interruption (user‑initiated interventions such as follow‑up questions, negations, repetitions, topic switches, or silence) and Rejection (handling non‑instructional speech like back‑channels, pauses, third‑party speech, or speech directed at others). Scripts are generated with DeepSeek and recorded by actors, ensuring that interruptions occur at semantically meaningful moments rather than random timestamps. Evaluation is performed in standardized Docker containers on NVIDIA RTX A6000 GPUs, measuring three dimensions: Interruption success rate and latency, Rejection success rate and early‑interrupt rate, and overall first‑response delay. The final score combines these as 0.4 × Interruption + 0.4 × Rejection + 0.2 × Delay.

Results: Over 100 teams registered, 15 submitted valid systems. In Track I, top teams (NJU‑TencentHY, BJTU Unisound, SenseDialog) achieved near‑perfect scores on trajectory detection and reasoning (≈4.9/5), but empathy generation lagged, with average scores around 4.0 for textual empathy and lower for audio naturalness. This indicates that while LLMs excel at logical emotional analysis, they still struggle to synthesize emotionally resonant speech. In Track II, Cookie asr attained the highest overall rank by balancing a strong interruption success rate (79.3 %), respectable rejection performance (72.2 %), and low latency (1.26 s). Badcat achieved the highest interruption success (89.7 %) but lower rejection. Across submissions, rejection scores were consistently lower than interruption scores, highlighting the difficulty of distinguishing background or third‑party speech from actionable user input.

The authors discuss that current Audio‑LLMs, despite their powerful unified understanding‑generation pipelines, need further research on multimodal emotional generation, real‑time turn‑taking prediction, and robust noise‑rejection mechanisms. They release the HumDial datasets publicly and plan follow‑up studies benchmarking commercial and open‑source models against these baselines, inviting the community to refine and extend the benchmark. The challenge thus provides a comprehensive, realistic platform for measuring progress toward truly human‑like spoken dialogue systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment