VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings
Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: https://vcbsl.github.io/VoxMorph/
💡 Research Summary
VoxMorph introduces a scalable, zero‑shot framework for voice identity morphing (VIM) that can generate high‑fidelity morphed speech from as little as five seconds of audio per speaker without any model retraining. The authors first motivate the security relevance of VIM, noting that while morphing attacks are well studied in visual biometrics, voice morphing has received almost no attention beyond a single prior work (Pani et al.) that required extensive speaker‑specific fine‑tuning and was limited to acoustically similar speaker pairs.
The core technical contribution is the explicit disentanglement of a speaker’s voice into two separate embeddings: a prosody embedding that captures speaking style (rhythm, pitch, intonation) and a timbre embedding that encodes the speaker’s unique vocal identity. Prosody is extracted using a GE2E‑style encoder, while timbre is obtained via a CAM++ encoder. By representing each aspect on a hypersphere, the authors avoid the distortion introduced by naïve linear averaging and instead fuse the embeddings of two source speakers independently using Spherical Linear Interpolation (Slerp). The interpolation factor α controls the blend ratio, allowing fine‑grained manipulation of style and identity.
The fused prosody embedding conditions an autoregressive language model (LM) that generates a sequence of discrete acoustic tokens from the input text. Simultaneously, the fused timbre embedding conditions a Conditional Flow Matching (CFM) network that transforms these tokens into a mel‑spectrogram through a probability‑flow ODE. Both stages employ classifier‑free guidance (CFG) to strengthen adherence to the conditioning embeddings. Finally, a HiFTNet neural vocoder converts the mel‑spectrogram into a waveform, yielding the final morphed speech sample.
Experiments are conducted on the clean subset of LibriSpeech, using 500 randomly selected male‑female speaker pairs (no similarity pre‑selection). Two variants of VoxMorph are evaluated: v1, which uses a single 5‑20 s clip per speaker, and v2, which aggregates up to seven short clips (≈1‑2 min total) for a richer voice profile. Baselines include the original VIM method (ViM), the zero‑shot voice conversion model VEVO, and a general‑purpose sound‑morphing system (MorphFader).
Performance is measured across four dimensions: audio quality (Fréchet Audio Distance – FAD, Kullback‑Leibler Divergence – KLD), intelligibility (Word Error Rate – WER), and morphing effectiveness (Mated Morphed Presentation Match Rate – MMPMR, and Fully Mated MP‑Rate – FMMPMR). VoxMorph‑v1 achieves FAD 0.24 and KLD 0.1404, outperforming ViM (FAD 1.52, KLD 0.3501) by more than sixfold in quality and 2.5× in spectral fidelity. WER drops to 0.33 (v1) and 0.19 (v2), a 73 % reduction relative to VEVO’s 0.54. Most strikingly, FMMPMR reaches 60.6 % (v1) and 67.8 % (v2) at a strict 0.01 % FAR threshold, whereas ViM records 0 % and VEVO only 9 % under the same conditions. These results demonstrate that VoxMorph can reliably produce morphed speech that is simultaneously accepted as belonging to both source identities, even under stringent security settings.
The authors discuss the broader security implications: the ability to generate convincing morphed voices with minimal data and computational cost transforms VIM into a realistic large‑scale threat. Existing ASV systems and morphing‑attack detection (MAD) solutions, which were designed for more limited attacks, may be insufficient. Consequently, the paper calls for the development of robust countermeasures, such as morphing‑aware ASV models and advanced detection algorithms.
Limitations are acknowledged: the study focuses on English read speech, leaving multilingual, dialectal, or emotional speech as future work; the perceptual impact of morphed voices on human listeners (e.g., trust, phishing risk) is not explored; and the current pipeline, while efficient, could be further optimized for real‑time deployment.
In summary, VoxMorph presents a novel, practical, and highly effective approach to voice identity morphing. By disentangling prosody and timbre, employing Slerp‑based interpolation, and leveraging state‑of‑the‑art generative components (autoregressive LM, Conditional Flow Matching, neural vocoder), the system achieves unprecedented audio quality and attack success rates without any speaker‑specific training. This work not only fills a critical gap in biometric security research but also establishes a new benchmark for both attackers and defenders in the evolving landscape of voice‑based authentication.
Comments & Academic Discussion
Loading comments...
Leave a Comment