PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in duplex speech models have enabled natural, low-latency speech-to-speech interactions. However, existing models are restricted to a fixed role and voice, limiting their ability to support structured, role-driven real-world applications and personalized interactions. In this work, we introduce PersonaPlex, a duplex conversational speech model that incorporates hybrid system prompts, combining role conditioning with text prompts and voice cloning with speech samples. PersonaPlex is trained on a large-scale synthetic dataset of paired prompts and user-agent conversations, generated with open-source large language models (LLM) and text-to-speech (TTS) models. To evaluate role conditioning in real-world settings, we extend the Full-Duplex-Bench benchmark beyond a single assistant role to multi-role customer service scenarios. Experiments show that PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness, surpassing state-of-the-art duplex speech models and hybrid large language model-based speech systems in role adherence, speaker similarity, latency, and naturalness.


💡 Research Summary

PersonaPlex introduces a novel approach to full‑duplex conversational speech modeling by combining role‑conditioning text prompts with voice‑cloning audio prompts in a single “Hybrid System Prompt”. Built on the Moshi architecture, the model receives three parallel streams—user audio, agent text, and agent audio—and uses the hybrid prompt to condition the agent on both a desired role (e.g., “customer‑service representative”) and a target speaker identity. The prompt consists of a short voice sample placed on the agent‑audio channel and a textual role description placed on the agent‑text channel, while the user‑audio channel is temporarily replaced by a 440 Hz sine wave to demarcate the prompt region. During training, loss on the prompt segment is masked, and token‑imbalance weighting (0.02 for non‑semantic audio tokens, 0.3 for padded text tokens) stabilizes learning.

To train the system, the authors generate a massive synthetic corpus. Service‑domain dialogs are created hierarchically: a domain (bank, restaurant, etc.) is sampled, a scenario (refund, information request) is chosen, and a high‑level description is expanded into a two‑speaker transcript using large language models (Qwen‑3‑32B, GPT‑OSS‑120B). For voice data, 26 k single‑speaker recordings from VoxCeleb, LibriSpeech, LibriTTS, CommonAccent, and Fisher are used as voice prompts; a separate test set of 2 630 samples evaluates speaker similarity. Dialogue audio is synthesized with multi‑speaker TTS models (Dia for service dialogs, Chatterbox for QA) that can clone the provided voice samples and generate realistic turn‑taking, interruptions, and room‑tone. The final training set totals 1840 h of service‑oriented dialogs (105 410 conversations) and 410 h of general QA (39 322 conversations).

Evaluation is performed on two benchmarks. The existing Full‑Duplex‑Bench (400 QA items) measures knowledge, reasoning, and interaction dynamics such as pause handling, back‑channeling, smooth turn‑taking, and user interruption. To assess fine‑grained role conditioning, the authors introduce Service‑Duplex‑Bench, extending the benchmark with 50 service roles each containing seven targeted questions (total 350). These probe proper‑noun recall, context adherence, handling of unfulfillable requests, and responses to rude customers.

Results show that PersonaPlex achieves state‑of‑the‑art performance. In human‑rated Dialogue MOS, it scores 3.90 ± 0.15 on Full‑Duplex‑Bench and 3.59 ± 0.12 on Service‑Duplex‑Bench, surpassing Gemini (3.72/3.22) and other baselines (Moshi, Freeze‑Omni, Qwen‑2.5‑Omni). Speaker similarity measured by WavLM‑TDNN cosine similarity reaches 0.57, far above the near‑zero scores of competing models. Turn‑taking metrics (TOR, latency, JSD) indicate lower latency and more natural interruption handling than prior duplex systems. On Service‑Duplex‑Bench, PersonaPlex’s average GPT‑4o‑based rating is 4.48, matching the best open‑source model and only slightly below Gemini Live (4.73). Ablation on dataset size demonstrates that even with 25 % of the synthetic data, performance remains competitive, though full data yields the best voice‑cloning consistency.

The paper acknowledges limitations: reliance on synthetic data may not capture the full range of human emotional prosody, and the use of a sine‑wave placeholder for the user channel could add unnecessary complexity in production. Moreover, the Service‑Duplex‑Bench focuses on single‑turn probes, leaving multi‑turn, complex service interactions for future work.

In conclusion, PersonaPlex demonstrates that hybrid text‑audio prompting can be seamlessly integrated into a full‑duplex speech‑to‑speech pipeline, enabling zero‑shot voice cloning and fine‑grained role conditioning without sacrificing latency or naturalness. The released checkpoint incorporates additional real conversational data (Fisher corpus) and refined synthetic voice generation, positioning PersonaPlex as the first open model that rivals commercial closed‑source systems in both voice fidelity and role adherence. Future directions include fine‑tuning on real annotated dialogs, post‑training alignment, and integration with external tools such as retrieval or tool‑calling APIs to broaden real‑world applicability.


Comments & Academic Discussion

Loading comments...

Leave a Comment