Conversational Behavior Modeling Foundation Model With Multi-Level Perception

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

💡 Research Summary

The paper tackles a fundamental gap in current full‑duplex spoken dialogue systems: while existing models treat conversation as a sequence‑generation problem (next‑segment or next‑dual‑token prediction), they ignore the cognitive chain that humans follow—perceiving high‑level intent, reasoning about it, and then producing an action. To bridge this, the authors propose a three‑stage framework that mirrors human perception‑reasoning‑generation loops: (1) hierarchical speech‑act perception, (2) Graph‑of‑Thoughts (GoT) reasoning, and (3) behavior generation with natural‑language rationales.

Hierarchical Perception
The perception module predicts two levels of labels for every one‑second chunk of audio: a high‑level communicative intent (Constative, Directive, Commissive, Acknowledgment) and a low‑level interaction act (Backchannel, Interruption, Turn‑taking, Continuation). Empirical analysis shows that the conditional distribution of low‑level acts varies strongly with the high‑level intent, confirming that modeling this hierarchy captures genuine causal structure. The module is strictly causal and streaming: it only accesses the current and past audio blocks, never future information.

Graph‑of‑Thoughts Reasoning
At each time step the detected high‑ and low‑level acts become nodes in a dynamic graph. Edges encode temporal adjacency and inferred causal links, forming a sliding‑window graph that reflects the evolving “chain of thought” in the dialogue. A transformer‑based encoder‑decoder processes this graph to (a) predict the next behavior (both intent and act) and (b) generate a concise natural‑language rationale. The rationale is not post‑hoc; during training it is supervised by human‑verified “anchor” sentences that were used to construct the graph, ensuring that explanations are grounded in observable evidence.

ConversationGoT‑120h Dataset
To train such a system, the authors built a 120‑hour hybrid corpus. Textual dialogues are generated with GPT‑4o, first creating rich speaker profiles, then a sequence of 8‑10 topics per conversation, split into joint and speaker‑driven topics. For each 1‑second segment, high‑ and low‑level labels and a rationale (derived from anchor sentences) are manually verified. Audio is synthesized using 1,166 high‑quality LibriSpeech voices via CosyVoice2, yielding realistic prosody and background noise. The dataset is deliberately causal: labels are assigned without reference to future segments, preventing leakage.

Training and Evaluation
Training proceeds in two stages. First, the hierarchical perception model is trained to output the two‑level labels from audio. Second, the GoT reasoning module is trained to construct the graph, predict the next node, and generate the rationale. Experiments on both synthetic dialogues (from ConversationGoT‑120h) and real‑world full‑duplex corpora (e.g., ARS‑2025) demonstrate substantial gains: behavior detection accuracy improves by 7–12 percentage points over strong baselines, and rationale quality scores (BLEU‑4 and human judgments) are markedly higher. Visualizations of the inferred graphs show that the model captures realistic intent‑action patterns, such as Acknowledgment frequently leading to Backchannel.

Limitations and Future Work
The authors acknowledge several constraints: (1) a fixed 1‑second granularity may be too coarse for ultra‑fast interruptions; (2) down‑mixing two‑channel audio to mono loses speaker‑identification cues important for multi‑speaker settings; (3) using a large language model (GPT‑5) for rationale generation incurs latency and computational cost, limiting deployment on low‑power devices. Future research directions include adaptive time‑step granularity, multi‑channel graph construction, and lightweight rationale generators.

Impact
By shifting the focus from “what to say next” to “why to say it next,” the paper introduces a new paradigm for conversational AI that is both interpretable and causally grounded. The combination of hierarchical perception, causal graph reasoning, and a meticulously curated dataset provides a solid foundation for building real‑time, explainable full‑duplex agents. Potential applications span human‑AI collaborative dialogue, live customer support, and socially aware robots, where understanding and articulating the reasoning behind a response is as crucial as the response itself.

Conversational Behavior Modeling Foundation Model With Multi-Level Perception

💡 Research Summary

Comments & Academic Discussion

Leave a Comment