FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full-duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce “contiguous monologues”, which are composed by continuous sentences and “waiting” intervals, mimicking human-like cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning contiguous monologues with audio. To this end, we develop a “dual” training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our contiguous monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.


💡 Research Summary

The paper tackles the core challenge of full‑duplex spoken dialogue systems: synchronizing textual monologue generation with audio output while keeping latency low. Traditional full‑duplex approaches fall into two categories. Time‑Division Multiplexing (TDM) interleaves listening, speaking, and monologue tokens in time slices, but the quadratic attention cost of Transformers leads to latencies up to two seconds and limits the maximum generation length, especially as model sizes grow. The alternative, native full‑duplex (as exemplified by Moshi), merges all channels at each time step, avoiding the context‑size explosion and achieving latencies as low as 80 ms. However, Moshi aligns text and audio at the word level, inserting special tokens so that each textual token coincides with its spoken counterpart. This design has two major drawbacks: it requires fine‑grained word‑level timestamps, inflating data‑preprocessing cost and introducing cascading alignment errors; and it does not reflect human conversational behavior, where internal monologues are continuous streams that typically precede spoken output.

To overcome these limitations, the authors introduce “contiguous monologues.” Instead of fragmenting sentences into word‑level tokens, a monologue is represented as an uninterrupted sequence of text tokens (a full sentence or paragraph). The audio channel generates 8 tokens per frame (12.5 fps) using a depth‑Transformer that operates locally on the hidden state of the backbone. When the textual stream finishes before the audio stream, the model emits special tokens until the speech concludes or an interruption occurs. This strategy (1) reduces annotation effort to sentence‑level timestamps only, and (2) preserves the language‑modeling strength of the pretrained LLM, enabling both natural dialog generation and responsive speech.

A second key contribution is the “dual training” paradigm. Training proceeds in stages where the monologue either leads the audio (TTS‑style) or trails it (ASR‑style). In the post‑training stage, roughly one million hours of automatically transcribed speech (Chinese via FunASR, English via Whisper) are mixed with high‑quality human‑annotated ASR datasets (Aishell3, Magicdata, Primewords, THCHS‑30). The automatically transcribed portion is down‑sampled by 50 % while the human‑annotated data is up‑sampled fivefold to emphasize accuracy. Each (audio, sentence) pair is tokenized into two formats: (a) TTS‑style, where the listening channel is empty, the monologue occupies the text channel, and speech tokens start two steps after the text; (b) ASR‑style, where speech tokens occupy the listening channel, followed by the monologue in the text channel. Both formats are concatenated with random silences and padded to a uniform length of 8192 tokens.

The loss function combines weighted cross‑entropy terms for (i) semantic audio tokens, (ii) acoustic audio tokens, (iii) monologue tokens, and (iv) tokens. Empirically effective weights are α₁ = 1 (semantic), α₂ = 0.5 (acoustic), β = 1 (monologue), γ = 0.01 (), a stark contrast to Moshi’s γ = 0.5 and much larger α values. This weighting prevents over‑emphasis on waiting tokens while still encouraging the model to learn asynchronous alignment.

Architecturally, FLM‑Audio uses a 7‑billion‑parameter autoregressive LLM (the language component of Qwen‑2) as the backbone, initialized from a multilingual checkpoint (English/Chinese). A RQ‑Transformer depth module processes the concatenated embeddings of text, listening, and speaking tokens at each step, generating eight audio tokens locally without recomputing full‑sequence attention, thereby keeping computational cost linear in time.

Experimental results show that FLM‑Audio matches or exceeds the performance of larger native‑full‑duplex baselines despite being trained on an order of magnitude less audio data (≈1 M h vs. >8 M h). Automatic metrics (MOS for speech naturalness, BLEU for textual quality, WER for ASR accuracy) improve across the board, and human evaluations confirm better perceived responsiveness and coherence. Latency remains at the 80 ms level, confirming that the contiguous monologue representation does not sacrifice real‑time capabilities.

In summary, the paper proposes a novel framework for native full‑duplex spoken dialogue agents that (1) aligns modalities at the sentence level via contiguous monologues, (2) employs a dual training schedule to cover both TTS‑like and ASR‑like scenarios, and (3) demonstrates that these design choices yield data‑efficient, high‑quality, low‑latency conversational agents. The authors also release the FLM‑Audio model and code, inviting the community to build upon this approach for more human‑like, real‑time multimodal AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment