Recent advancements in joint speech-text models have demonstrated great potential for seamless voice interactions. However, existing models face critical challenges: the temporal resolution mismatch between speech tokens (typically 25Hz) and text tokens (approximately 3Hz) dilutes semantic information, incurs high computational costs limit practical deployment, and leads to catastrophic forgetting of text LLM knowledge during multimodal training. In this work, we introduce Fun-Audio-Chat, a Large Audio Language Model (LALM) that addresses these limitations by adopting two key innovations from our previous work DrVoice. First, we employ the Dual-Resolution Speech Representations (DRSR) architecture: the Shared LLM backbone processes audio at an efficient 5Hz frame rate (achieved through speech token grouping), while the Speech Refined Head (SRH) generates high-quality speech tokens at 25Hz resolution. This dual-resolution design effectively balances computational efficiency (reducing GPU hours by nearly 50%) and speech generation quality. Second, we adopt the Core-Cocktail Training strategy in full supervised fine-tuning, a two-stage training approach with intermediate model merging that mitigates catastrophic forgetting. After Core-Cocktail training, we introduce Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy capabilities. This multistage post-training paradigm enables Fun-Audio-Chat to effectively retain knowledge of the original text LLM while gaining powerful audio understanding, reasoning, and generation skills. Different from the majority of recent LALMs that rely on both large-scale audio-text pre-training and post-training to develop audio capabilities, Fun-Audio-Chat only leverages pre-trained models and utilizes extensive post-training. Fun-Audio-Chat dense 8B and MoE 30B-A3B models achieve competitive performance on Speech-to-Text and Speech-to-Speech generation tasks, ranking Top among models of similar scales across multiple Spoken Question Answering benchmarks. It also achieves competitive to superior performance on Audio Understanding, Speech Function Calling, Speech Instruction-Following and Voice Empathy benchmarks. We further develop Fun-Audio-Chat-Duplex, a full-duplex variant that achieves strong performance on Spoken Question Answering benchmarks and full-duplex interactions. We open-source the Fun-Audio-Chat-8B model checkpoint with its training and inference code, and provide an interactive demo.
๐ Full Content
The development of spoken dialogue systems is critical to human-computer interaction, as natural human communication inherently relies on verbal exchanges. Recently, Large Language Model (LLM) based spoken dialogue systems, exemplified by systems like GPT-4o (OpenAI, 2024b), demonstrate great FunAudioLLM@list.alibaba-inc.com potential for seamless and natural voice interactions with users. LLM-based spoken dialogue systems can be generally categorized into cascaded and end-to-end (E2E) systems, with the distinction lying in whether the backbone LLM can directly comprehend speech representations and generate speech outputs.
Many recent E2E models focus on Joint Speech-Text Models (Dรฉfossez et al., 2024;Chen et al., 2024a;KimiTeam et al., 2025), where LLMs take speech representations as input and generate both text tokens and speech tokens simultaneously. However, existing joint speech-text models face critical challenges: (1) the temporal resolution mismatch between speech tokens (typically 25Hz) and text tokens (approximately 3Hz) (Chen et al., 2024a) dilutes semantic information and hinders the full utilization of the LLM’s core capabilities; (2) continual pre-training and post-training the text-LLM backbone into multimodal models often lead to catastrophic forgetting of the text LLM’s knowledge; (3) high computational costs due to high audio frame rates (typically 12.5Hz or 25Hz), limiting practical deployment.
In this work, we present Fun-Audio-Chat, a parallel large audio language model (LALM) that extends our previous work DrVoice (Tan et al., 2025) by adopting the Dual-Resolution Speech Representations (DRSR) architecture and scaling it up to significantly larger training datasets of millions of hours of diverse audio data and larger model scales (dense 8B and MoE 30B-A3B1 ).
For speech comprehension in Fun-Audio-Chat, we employ a grouping mechanism that maps 25Hz audio tokens to 5Hz speech representations, enabling the shared LLM backbone to process audio at an efficient 5Hz frame rate. During generation, the hidden states from the shared LLM layer are passed in parallel to a Text Head for text token prediction and a Speech Refined Head (SRH) to generate high-quality speech tokens at 25Hz resolution. This dual-resolution design effectively balances computational efficiency (reducing GPU hours by nearly 50%) and speech generation quality.
The majority of recent open-source LALMs and Omni-language-models rely on both large-scale audio-text pre-training (e.g., audio/text unimodal pre-training, audio-text mapping and interleaving pre-training tasks) and post-training to develop strong audio capabilities, such as Kimi-Audio (KimiTeam et al., 2025),
Step-Audio 2 (Wu et al., 2025), MiMo-Audio (Xiaomi, 2025), and Longcat-Flash-Omni (Team, 2025b). In contrast, Fun-Audio-Chat leverages pre-trained models and is trained with a multi-stage post-training paradigm, without large-scale audio-text pre-training (similarly, Audio-Flamingo-3 (Goel et al., 2025) also does not use large-scale audio-text pre-training). After initialization from text-based or vision-language LLMs, the Pre-alignment stage updates the audio encoder, the adapter, and the Speech Refined Head using large-scale speech-text paired data. We then adopt the Core-Cocktail Training strategy proposed in our earlier work DrVoice (Tan et al., 2025) used benchmarks including OpenAudioBench2 , VoiceBench (Chen et al., 2024b), UltraEval-Audio3 , MMAU (Sakshi et al., 2025), MMAU-Pro (Kumar et al., 2025), MMSU (Wang et al., 2025a), multiple speech function calling benchmarks, and VStyle (Zhan et al., 2025). Detailed evaluation results are presented in Section 3. โข Full-Duplex Voice Interaction. We extend Fun-Audio-Chat to a full-duplex variant, Fun-Audio-Chat-Duplex, which supports simultaneous two-way communications. This model achieves competitive performance on Spoken Question Answering benchmarks, suggesting excellent intelligence, and strong performance in full-duplex interaction metrics (Section 3), demonstrating superior capabilities in natural conversation and turn-taking.
To achieve robust audio comprehension, Fun-Audio-Chat employs Whisper-Large-v3 (Radford et al., 2022) as the Speech Encoder to derive continuous representations from user audio inputs. An Adapter module is then applied to reduce the temporal resolution of these features and match their dimensionality to the LLM’s hidden space. Given the demonstrated effectiveness of semantic tokens for speech representations (Zhang et al., 2023a;Borsos et al., 2023), particularly their strong correspondence with textual content (Zhang et al., 2023b), we adopt S3Tokenizer (Du et al., 2024a;b;2025) as the Speech Tokenizer to transform audio waveforms into discrete semantic token sequences S = [s 0 , s 1 , โข โข โข , s T-1 ] (T denotes the sequence length) for the assistant’s output. In the reverse process, the Speech Detokenizer leverages speaker-specific embeddings that encode acoustic characteristics like timbre. The Flow Matching model (Lipman et al., 2023) generates Mel-spectrogram representations from these tokens, which are then converted back to audio waveforms using the HiFi-GAN vocoder (Kong et al., 2020).
To maintain the text capabilities of pretrained text LLMs while supporting cross-modal functionality, Fun-Audio-Chat adopts the Dual-Resolution Speech Representations (DRSR) architecture from our earlier work DrVoice (Tan et al., 2025). This architecture effectively addresses the temporal resolution mismatch between speech tokens (typically 25Hz) and text tokens (approximately 3Hz), improves computational efficiency, and achieves high-quality speech generation.
Speech Token Grouping. To bridge the temporal resolution discrepancy, we apply a grouping technique from DrVoice (Tan et al., 2025) that reduces 25Hz speech tokens to 5Hz representations for the Shared LLM backbone. The grouping transformation is expressed as follows:
where s j represents individual speech tokens, Concat indicates concatenation, and k = 5 is the grouping factor based on the ratio of speech token frequency (25Hz) to the desired LLM processing frequency (5Hz). This mechanism reduces sequence length from T to T/k, allowing the Shared LLM to operate at a 5Hz frame rate, which substantially reduces computational overhead (yielding approximately 50% reduction in training GPU hours) while retaining the semantic reasoning abilities of the LLM.
Although grouping facilitates efficient processing, it sacrifices fine-grained acoustic information essential for natural speech synthesis. To compensate this limitation, Fun-Audio-Chat integrates a specialized Speech Refined Head (SRH) that generates speech tokens at the complete 25Hz resolution. The SRH executes an ungrouping operation: the final hidden state from the Shared LLM, h [SLLM] L , is initially transformed into group-sized embeddings through linear projection:
which is followed by decomposition into k segments:
where
The resulting H provides conditional context for SRH, which generates speech tokens autoregressively at 25Hz. The training objective optimizes speech token prediction:
where s i denotes the i-th speech token. This dual-resolution framework allows Fun-Audio-Chat to simultaneously achieve computational efficiency (5Hz processing in the Shared LLM Layer) and highfidelity speech synthesis (25Hz generation through SRH), following the design principles established in DrVoice (Tan et al., 2025).
The MLLM architecture extends pretrained text-LLMs to support unified audio-text processing, enabling the model to handle either speech or text inputs and generate simultaneous speech and text outputs.
Fun-Audio-Chat is a Parallel Joint Speech-Text Model. Following the approach in Moshi (Dรฉfossez et al., 2024), we integrate explicit text streams to provide semantic guidance for speech generation. Our design concentrates modality alignment solely on the assistant side, reflecting the inherent asymmetry in human-computer dialogue: Users typically provide single-modality inputs (text or speech), while assistants can deliver coordinated multimodal responses (that is, joint speech-text response or text-only response.
The model exploits the autoregressive nature of LLMs by iteratively incorporating both speech tokens s t and text tokens t t into the Shared LLM Layer at each step. These token embeddings are combined through addition to create a unified input representation. The composite embedding c t at step t is formulated as:
where E speech and E text represent the embedding functions for speech and text tokens, respectively. To handle the length mismatch between speech and text sequences, we pad the shorter sequence with a special silence token <|SIL|>for each utterance.
The generation follows an autoregressive pattern:
where x denotes the input and y t = (s t , t t ) represents the combined speech-text output at step t. This formulation unifies speech and text generation within one autoregressive process.
Fun-Audio-Chat leverages existing pre-trained models and is trained with a multi-stage post-training pipeline, utilizing millions of hours of diverse speech data that encompasses diverse domains and tasks, including conversational and multilingual speech, audio for understanding tasks, ensuring comprehensive coverage of various scenarios and use cases. The training data combines open-source data following the training setup of DrVoice (Tan et al., 2025) and Audio-Flamingo-3 (Goel et al., 2025), along with in-house text, ASR, TTS, audio understanding, speech instruction-following, and voice empathy data.
The multi-stage training pipeline includes: (1) Pre-alignment uses large-scale speech-text paired data for aligning the Speech Encoder, the Adapter, and the Speech Refined Head;
(2) Core-Cocktail Training, for supervised full fine-tuning, employs high quality speech data synthesized from billions of text tokens using CosyVoice 3 (Du et al., 2025) and selected by thresholding on synthesis Word Error Rate (WER);
(3) Multi-Task DPO Training employs diverse real speech data for robustness enhancement, audio understanding and ASR data for comprehension capabilities, instruction-following data (including emotion, style, and prosody control) for speech instruction-following capabilities, and voice empathy data for emotion understanding and empathetic response generation capabilities. This training pipeline is carefully designed to progressively enhance the model’s audio comprehension, reasoning, and generation capabilities, while retaining the text capabilities of the backbone LLM.
The training process begins with proper initialization of model components. The Speech Encoder is initialized with the weights of Whisper-Large-v3 (Radford et al., 2022;Xu et al., 2025), providing robust voice understanding capabilities. The Shared LLM Layer is initialized using Qwen3-30B-A3B (Yang et al., 2025) or alternatively from Vision-language base models Qwen3-VL-8B (Bai et al., 2025), leveraging the strong semantic understanding capabilities of the pre-trained text LLMs. The pre-trained Speech Tokenizer and Detokenizer from CosyVoice 3 (Du et al., 2025) are employed and kept frozen throughout the entire training process of Fun-Audio-Chat. To establish effective alignment between audio and text modalities, we perform Pre-alignment training using large-scale speech-text pair data to align the Speech Encoder, the Adapter, and the Speech Refined Head before the main training stages. During this pre-alignment stage, the Shared LLM Layer is kept frozen to preserve its pre-trained capabilities.
Core-Cocktail Training. We find that multimodal model training faces a fundamental learning rate trade-off: high learning rates risk degrading the MLLM performance and exacerbating catastrophic forgetting of the base text-LLM’s knowledge, while low learning rates cause slow convergence and training stagnation. To address this optimization dilemma and prevent knowledge loss, we utilize the Core-Cocktail Training methodology introduced in our earlier work DrVoice (Tan et al., 2025), which employs a two-phase training procedure.
Stage 1: Fine-tuning with High Learning Rate. In this initial phase, we perform full fine-tuning on all MLLM parameters, the Audio Encoder, and Adapter using an elevated learning rate. For Fun-Audio-Chat, the learning rate is decayed from 1 ร 10 -4 to 1 ร 10 -5 in Stage 1, utilizing a cosine annealing schedule. This stage aims to quickly shift model parameters toward regions of the loss surface that are more conducive to multimodal learning, facilitating rapid task adaptation.
where ฮฑ controls the interpolation balance. This merging operation reintroduces the foundational knowledge from the base LLM, safeguarding the original text understanding capabilities. Lower ฮฑ values favor stronger retention of the base LLM’s knowledge. In our implementation, ฮฑ is set to 0.5.
Stage 2: Refinement with Low Learning Rate. Stage 2 applies full fine-tuning to the merged model M r with a reduced learning rate. For Fun-Audio-Chat, the learning rate is decayed from 1 ร 10 -5 to 1 ร 10 -6 in Stage 2, also utilizing a cosine annealing schedule. This enables stable, precise optimization that improves model performance without the instability associated with high learning rates. The Core-Cocktail Training strategy successfully reconciles fast adaptation with knowledge retention, substantially mitigating catastrophic forgetting while promoting effective multimodal learning. Fun-Audio-Chat supports a maximum context length of 2048 tokens (approximately 6 minutes of speech), sufficiently facilitating typical conversational interactions. (Rafailov et al., 2023) to enhance the model’s robustness to real speech data, audio understanding abilities, speech instruction-following and voice empathy capabilities. The Multi-Task DPO Training stage incorporates multiple preference learning objectives: (1) robustness preference: preferring responses that maintain quality under noisy or diverse speech inputs; (2) instruction-following preference: preferring responses that accurately follow voice instructions, including emotion, style, and prosody control; (3) audio understanding preference: preferring responses that demonstrate accurate comprehension of audio content; and (4) voice empathy preference: preferring responses that show appropriate emotional understanding and empathetic responses. The DPO training loss is computed across these multiple preference dimensions, allowing the model to learn a unified preference signal that balances all these capabilities. This multi-task DPO training stage enables the model to better align with human preferences and improve performance on real-world conversational scenarios, distinguishing Fun-Audio-Chat from previous works that primarily rely on supervised fine-tuning.
To enable real-time full-duplex voice interaction, we introduce a parallel speech-text input stream architecture and extend Fun-Audio-Chat to a full-duplex variant, Fun-Audio-Chat-Duplex, which can support natural human-like conversations with seamless twoway communication. Specifically, the parallel speech-text input stream architecture allows the model to accept user speech when the assistant is generating speech, effectively utilizing the time slots that would otherwise be idle. The parallel input stream is designed to handle both user and assistant speech inputs simultaneously, enabling the model to process overlapping speech segments and maintain conversation context. The Full-duplex Interaction Training continues from the checkpoint resulting from the Core-Cocktail Training stage, building upon the multimodal capabilities that the model already acquires. Full-duplex training uses full-duplex conversation data synthesized by augmenting highquality half-duplex dialogue datasets with simulated full-duplex interaction behaviors, following the data synthesis approach in OmniFlatten (Zhang et al., 2025). This approach transforms traditional turn-based text dialogues into concurrent dual-stream interactions, in which both user and assistant can speak simultaneously. Full-duplex training allows the model to learn natural turn-taking, interruption handling, and backchanneling behaviors.
Evaluation Datasets. Following prior works (Yao et al., 2024;KimiTeam et al., 2025), we evaluate the performance of Fun-Audio-Chat comprehensively on the widely used benchmarks:
โข Speech-To-Text (S โ T) Evaluation. We use two types of Spoken Question Answering benchmarks to evaluate the model’s ability to understand speech inputs and generate both text and speech responses, including S โ T Evaluation and S โ S Evaluation. For S โ T evaluation, we use VoiceBench (Chen et al., 2024b) and OpenAudioBench4 . VoiceBench encompasses AlpacaEval, CommonEval, SD-QA, MMSU, OpenBookQA, IFEval, and AdvBench, providing comprehensive evaluation across instruction-following, general knowledge, safety alignment, and robustness to real-world variations.
In contrast, OpenAudioBench includes multiple sub-tasks including AlpacaEval (Li et al., 2023), Llama Q., Reasoning QA, TriviaQA, and Web Q., covering diverse Spoken Question Answering scenarios, with more focuses on general knowledge and reasoning and less emphases on robustness. โข Speech-to-Speech (S โ S) Evaluation. We use UltraEval-Audio5 , which includes AlpacaEval, Llama Q., TriviaQA, and Web Q. for end-to-end Speech-to-Speech Question Answering evaluation. โข Audio Understanding. We evaluate on audio understanding benchmarks including MMAU (Sakshi et al., 2025), MMAU-Pro (Kumar et al., 2025), and MMSU (Wang et al., 2025a) for comprehensive audio comprehension capabilities. These benchmarks focus on different aspects of audio understanding: MMAU is a generalist benchmark covering the “Big Three” audio domains (Speech, Music, Sound) with a focus on complex reasoning; MMAU-Pro is an advanced-scenario benchmark that stresses models with “wild” conditions like long-form audio, spatial audio, and overlapping sounds; MMSU is a speech specialist benchmark grounded in linguistic theory, focusing deeply on the nuances of spoken language (intonation, emotion, prosody) rather than general environmental sounds or music. โข Speech Recognition. We evaluate ASR performance on the widely used Librispeech (Panayotov et al., 2015) for English (EN) ASR and Common Voice (Ardila et al., 2020) for English and Mandarin (ZH) ASR.
โข Speech Function Calling. We evaluate on Speech-ACEBench, Speech-BFCL, and Speech-SmartInteract6 for evaluating the model’s ability to execute function calls based on speech instructions. These three benchmarks focus on different aspects of speech function calling: Speech-ACEBench is derived from the text-based ACEBench (Chen et al., 2025) and contains Mandarin speech recorded by human speakers. It covers both single and parallel function calling scenarios, with particular emphasis on cases where functions take nested (deep) object-type arguments. Speech-BFCL is derived from BFCL (Patil et al., 2025) and consists of English data synthesized with TTS. It also targets single and parallel function calling, but focusing on TTS-generated English interactions. Speech-SmartInteract is a purpose-built TTS-synthesized Mandarin speech dataset designed specifically for speech-first interactive use; rather than merely voicing a text-based benchmark, it better reflects the characteristics of real spoken interactions in practical voice assistant settings. โข Speech Instruction-Following and Voice Empathy. We use the VStyle benchmark (Zhan et al., 2025) to evaluate the model’s ability to understand and execute voice instructions for controlling speech generation attributes such as emotion, speaking style, speed, pitch, and volume. We also use an internal test set to assess the model’s speech instruction-following and voice empathy capabilities, including understanding emotional context and responding with appropriate empathetic expressions.
Evaluation Metrics. Evaluations adhere to the established protocols for each respective benchmark.
For S โ T and S โ S evaluations on Spoken Question Answering benchmarks, we use different metrics depending on the task type: (1) Accuracy is used for close-ended QA tasks including Llama Q., Reasoning QA, TriviaQA, Web Q., SD-QA, MMSU, OpenBookQA, and IFEval;
(2) G-Eval (Liu et al., 2023) is used for open-ended QA tasks including AlpacaEval (Li et al., 2023) and CommonEval, which employs LLM-based evaluation to assess response quality; (3) Refusal Rate is reported for AdvBench to measure safety compliance.
Additionally, for Speech Quality evaluation, the generated speech is transcribed using Whisper-v3-large model (Radford et al., 2022), then ASR-WER (Word Error Rate of the ASR-transcripts against the modelgenerated text) is used to assess the alignment between the generated speech and text. UTMOS (Saeki et al., 2022) is used to evaluate the overall speech quality, For Audio Understanding tasks (MMAU, MMAU-Pro, MMSU), Accuracy is used to measure the model’s comprehension capabilities across diverse audio understanding scenarios. For Speech Recognition tasks (Librispeech, Common Voice), Word Error Rate (WER) is reported.
For Speech Function Calling tasks (Speech-ACEBench, Speech-BFCL, Speech-SmartInteract), Accuracy is used to measure the percentage of correctly executed function calls.
For Speech Instruction-Following and Voice Empathy tasks, for the VStyle benchmark, we use Large Audio Language Model (LALM) evaluation scores on a 1-5 scale across multiple dimensions: acoustic attributes (age, speed, gender, emotion, pitch, volume), instruction following (emotion, style, variation), role-play (scenario, character), and empathy (anger, sadness, anxiety, joy). For evaluations on our internal test set, similar to the VStyle benchmark, we use LALM as Judge to evaluate the model’s performance on Speech Instruction-Following, Semantics-based Empathy, and Paralinguistic-Cue-based Empathy.
Semantics-based Empathy refers to the empathy capability that can be judged solely based on text semantics, while Paralinguistic-Cue-based Empathy refers to the empathy capability that requires using Paralinguistic Cues to judge and cannot be judged solely from text semantics.
For Full-Duplex Interaction evaluation, we use S2M-T (the text output accuracy in multimodal response) and S2M-S (the speech output accuracy in multimodal response) to measure the knowledge understanding performance, and the Turn-taking Success Rate to measure the percentage of interactions where the model correctly handles turn-taking in full-duplex scenarios.
Baselines. We select representative and competitive models as baselines to ensure comprehensive comparisons across different model sizes and model architectures. For around-8B dense models, we compare Fun-Audio-Chat-8B with open-source Large Audio Language Models (LALMs) including GLM-4-Voice (9B) (Zeng et al., 2024), MiniCPM-o 2.6 (7B) (Yao et al., 2024), Baichuan-Omni-1.5 (7B) (Li et al., 2025b), Kimi-Audio (7B) (KimiTeam et al., 2025), Step-Audio2-Mini (7B) (Wu et al., 2025), and MiMo-Audio (7B) (Xiaomi, 2025). For large-scale models, we compare Fun-Audio-Chat-30B-A3B with the open-source Longcat-Flash-Omni-Instruct (560B-A27B) (Team, 2025b) and the closed-source GPT-Audio (OpenAI, 2024b) and Gemini-2.5-Pro (Team, 2025a). For audio understanding tasks, we additionally compare with the open-source Audio-Flamingo-3 (Goel et al., 2025) alongside Kimi-Audio, Step-Audio2-Mini, and MiMo-Audio. For speech instruction-following and voice empathy tasks (VStyle benchmark), we compare with the open-source Baichuan-Audio (Li et al., 2025a) and Kimi-Audio, and add the closed-source GPT-4o (OpenAI, 2024b) and Doubao7 into baselines for comprehensive comparison with both open-source and commercial models. For full-duplex interaction evaluation, we compare Fun-Audio-Chat-Duplex with the open-source Moshi (Dรฉfossez et al., 2024) and FreezeOmni (Wang et al., 2025b).
In summary, the selected baselines cover diverse modeling paradigms (Text-Driven vs. Joint Speech-Text, interleaved vs. parallel architectures) and model scales, enabling systematic comparisons across mainstream speech-text modeling strategies and providing comprehensive evaluation of Fun-Audio-Chat’s capabilities across different task categories.
Accuracy. Fun-Audio-Chat demonstrates strong performance on spoken question answering tasks. Table 1 compares Fun-Audio-Chat-30B-A3B with large-scale baselines, including GPT-Audio, Gemini-2.5-Pro, and Longcat-Flash-Omni-Instruct. Table 2 compares Fun-Audio-Chat-8B with Kimi-Audio,
Step-Audio2-Mini, and other similarly-scaled open-source models. As shown in Table 1 andTable 2, Fun-Audio-Chat achieves competitive performance among similarly-scaled models (8B and 30B-A3B parameters). Specifically, Fun-Audio-Chat-8B achieves the best overall performance on OpenAudioBench (76.61%) and VoiceBench (83.21%) among โผ8B-scale models, while Fun-Audio-Chat-30B-A3B achieves competitive results compared to large-scale baselines, including top-tier closed-source models.
Speech Quality. We evaluate the speech quality of Fun-Audio-Chat-8B on UltraEval-Audio using UTMOS for the overall speech quality and ASR-WER for alignment between the generated speech and text. On the Llama Q. test set, Fun-Audio-Chat-8B achieves a UTMOS score of 4.37, indicating excellent overall speech quality, and an ASR-WER of 4.32%, demonstrating strong alignment between the generated speech and the corresponding text outputs. These results demonstrate that the dual-resolution architecture maintains high-quality speech generation despite operating at the efficient 5Hz frame rate, validating the effectiveness of the Dual-Resolution Speech Representations (DRSR) architecture in balancing efficiency and speech quality. KimiTeam et al., 2025), Audio-Flamingo-3 (Goel et al., 2025), MiMo-Audio (Xiaomi, 2025), and Step-Audio2-Mini (Wu et al., 2025). On MMAU8 , Fun-Audio-Chat-30B-A3B achieves the best performance (77.9%) among all evaluated models, followed by Fun-Audio-Chat-8B (76.6%). On MMAU-Pro9 , Fun-Audio-Chat-30B-A3B achieves the best result (59.9%), with Fun- Audio-Chat-8B achieving the second-best performance (58.0%). On MMSU, Fun-Audio-Chat-30B-A3B achieves 70.1%, the highest result among all models, followed by Fun-Audio-Chat-8B (67.8%). For speech recognition tasks, Fun-Audio-Chat achieves competitive WERs across multiple datasets in both English (EN) and Mandarin (ZH), demonstrating robust audio comprehension capabilities across diverse domains and languages.
Table 4 presents the performance of Fun-Audio-Chat on speech function calling benchmarks. Fun-Audio-Chat-30B-A3B achieves the highest overall score (79.63%) among all evaluated models, with particularly strong performance on Speech-ACEBench (Single: 76.40%) and Speech-SmartInteract (84.13%). The model demonstrates strong capabilities in understanding speech-based function calling instructions and executing them accurately, which is crucial for building practical voice-controlled applications. The performance on parallel function calling scenarios (54.50% on ACEBench-Parallel and 87.63% on BFCL-Parallel by Fun-Audio-Chat-8B) further highlights Fun-Audio-Chat’s ability to handle complex, multi-step instructions in voice interactions, with Fun-Audio-Chat-8B outperforming the top tier closed-source GPT-Audio and Gemini-2.5-Pro on BFCL-Parallel.
We evaluate the full-duplex variant Fun-Audio-Chat-Duplex on two key aspects: knowledge understanding in full-duplex scenarios and objective full-duplex interaction metrics.
Full-Duplex Knowledge Understanding. Table 7 shows the full-duplex knowledge understanding performance of Fun-Audio-Chat-Duplex. The results demonstrate that Fun-Audio-Chat-Duplex maintains strong knowledge understanding capabilities in full-duplex conversation scenarios. Fun-Audio-Chat-Duplex-30B-A3B achieves the highest average performance on both S2M-T (54.89%) and S2M-S (49.28%) metrics, significantly outperforming Moshi (33.17%/29.86%) and FreezeOmni (Wang et al., 2025b) (47.58%/34.49%). On individual benchmarks, Fun-Audio-Chat-Duplex-30B-A3B achieves the highest results on Llama Q. (81.00%/71.33%), AlpacaEval (68.23%/59.65%), and TriviaQA (41.70%/40.04%) for both text and speech outputs. This indicates that the full-duplex architecture successfully preserves the model’s knowledge comprehension abilities while enabling simultaneous two-way communication, allowing the system to maintain context and understanding even when processing overlapping speech inputs and outputs.
Full-Duplex Interaction. Table 7 also presents the turn-taking success rates for full-duplex voice interactions. Fun-Audio-Chat-Duplex-30B-A3B achieves perfect turn-taking success rate (100.00%), outperforming both Moshi (99.77%) and FreezeOmni (Wang et al., 2025b) (93.87%). Fun-Audio-Chat-Duplex-8B achieves 99.94%, also demonstrating excellent turn-taking capabilities. These results indicate that Fun-Audio-Chat-Duplex successfully enables natural and efficient full-duplex voice interactions, with the model’s ability to handle simultaneous speech and maintain appropriate conversation flow, closely mirroring the dynamics of human-human conversations.
A key advantage of Fun-Audio-Chat is its computational efficiency, highlighted in Table 1 andTable 2. As shown in the Frame Rate-In/Frame Rate-Out rows, Fun-Audio-Chat operates at a frame rate of 5/5 Hz, indicating that the LLM backbone processes only 5 audio tokens per second for both input and output. This represents a 1.25ร to 5ร reduction in input frame rate compared to other models, which operate at frame rates ranging from 6.25Hz (MiMo-Audio) to 25Hz (MiniCPM-o 2.6), with most models using 12.5Hz. For output frame rates, Fun-Audio-Chat’s 5Hz is significantly lower than other models, which operate at rates of 12.5Hz, 16.67Hz, 25Hz, or higher when including text token generation, e.g., 12.5+ฯ for GLM-4-Voice and Baichuan-Omni-1.5, 25+ฯ for Step-Audio2-Mini, where ฯ denotes the average number of text tokens per second of speech. The dual-resolution design significantly reduces computational requirements and potential latency, with empirical measurements showing approximately 50% reduc-
This report introduces Fun-Audio-Chat, a large-scale Large Audio Language Model (LALM) designed to overcome the limitations of existing joint speech-text models for seamless voice interaction. Fun-Audio-Chat extends our previous work DrVoice (Tan et al., 2025) by adopting one key innovation, Dual-Resolution Speech Representations (DRSR) architecture, at significantly larger scales. The DRSR architecture enables the Shared LLM backbone to process audio at an efficient 5Hz frame rate (Frame Rate-In/Frame Rate-Out: 5/5 Hz) while the Speech Refined Head generates high-quality speech tokens at 25Hz resolution. This dual-resolution design effectively balances computational efficiency (reducing GPU hours by nearly 50%) and speech generation quality.
To We open-source Fun-Audio-Chat-8B model, including the model checkpoint and its training and inference code, and provide an interactive demo, encouraging researchers and practitioners to experience and build upon our work. We believe that Fun-Audio-Chat represents a significant advancement in the field of voice interaction systems, demonstrating that carefully designed large-scale post-training and architectural innovations can significantly enhance the audio comprehension, reasoning, and speech generation capabilities of LALMs while achieving high computational efficiency.
While Fun-Audio-Chat demonstrates strong performance across multiple benchmarks, several limitations remain to be addressed in future work. First, for complex question answering in multi-turn conversations, the model occasionally exhibits memory loss of context, where information from earlier turns may not be consistently retained. This limitation is particularly noticeable in scenarios requiring long-context comprehension and complex reasoning across multiple turns.
Second, speech instruction-following capabilities show some instability in expressiveness. While the model generally performs strongly on voice instruction tasks, there are cases where the generated speech may not fully capture the intended emotional nuances, speaking styles, or prosodic variations specified in the instructions. This variability in expressiveness can affect the naturalness and appropriateness of voice responses in certain contexts.
Third, the voice empathy capabilities demonstrate some instability in performance. Although Fun-Audio-Chat achieves competitive results on empathy evaluation benchmarks (including both Semantics-based Empathy and Paralinguistic-Cue-based Empathy), the model’s ability to consistently recognize and respond with appropriate emotional empathy can vary across different scenarios and emotional contexts. This inconsistency may impact the reliability of empathetic response generation in real-world applications where emotional understanding is critical.
These limitations highlight important directions for future research, including improving long-term context management in multi-turn conversations, enhancing the stability and expressiveness of speech instruction-following, and developing more robust and consistent voice empathy capabilities across diverse emotional scenarios.
All contributors of Fun-Audio-Chat are listed in alphabetical order by their last names.
This work verifies that the two key innovations in DrVoice demonstrate excellent scalability: DRSR, with its efficient 5Hz processing for the backbone LLM and 25Hz generation head, retains high computational efficiency (approximately 50% reduction in training GPU hours) at larger scales; and Core-Cocktail Training strategy, with its two-stage training using different learning rates and intermediate model merging, effectively mitigates catastrophic forgetting in both 8B and 30B-A3B models. The large-scale post-training enables Fun-Audio-Chat to achieve superior performance across multiple benchmarks while maintaining the computational efficiency advantages from the dual-resolution design. โข MultiCore-Cocktail Training, we introduce Multi-Task DPO Training to enhance the capabilities of Fun-Audio-Chat in multiple dimensions: robustness to real speech data, capabilities of instruction-following, audio understanding, and voice empathy. This training approach enables the model to better align with human preferences and improve performance on real-world conversational scenarios, distinguishing Fun-Audio-Chat from previous works that primarily rely on supervised fine-tuning. Through โข Comprehensive Evaluation and Strong Performance. Extensive evaluations demonstrate that Fun-Audio-Chat 8B and 30B-A3B achieve superior performance on Spoken Question Answering (on both Speech-to-Text and Speech-to-Speech generation tasks), ranking top among models of similar scales. It also demonstrates competitive capabilities of Audio Understanding, Speech Function Calling, Speech Instruction-Following, and Voice Empathy, as demonstrated across a wide variety of commonly
, including Kimi-Audio (
and Table6demonstrate that Fun-Audio-Chat achieves strong performance on Speech Instruction-Following and Voice Empathy tasks. As shown in Table5, Fun-Audio-Chat-30B-A3B and Fun-Audio-Chat-8B demonstrate competitive performance on Speech Instruction-Following across multiple dimensions, including acoustic attributes, instruction following, role-play, and empathy capabilities, in
and Table6demonstrate that Fun-Audio-Chat achieves strong performance on Speech Instruction-Following and Voice Empathy tasks. As shown in Table5
and Table6demonstrate that Fun-Audio-Chat achieves strong performance on Speech Instruction-Following and Voice Empathy tasks. As shown in Table
and Table6
and Table
30B-A3B denotes a Mixture-of-Experts (MoE) model with 30B total parameters and 3B active parameters.