Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation

Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker’s timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.


💡 Research Summary

Fed‑PISA (Federated Personalized Identity‑Style Adaptation) tackles two fundamental challenges in federated text‑to‑speech (TTS) voice cloning: (1) the high communication overhead of transmitting large neural‑network backbones, and (2) the suppression of stylistic heterogeneity when trying to preserve each speaker’s timbre. The authors propose a dual‑adapter architecture based on Low‑Rank Adaptation (LoRA). An ID‑LoRA is trained locally on a client’s neutral speech to capture the speaker’s unique timbre; once trained it is frozen and never uploaded, guaranteeing privacy and eliminating any timbre‑related communication cost. A lightweight Style‑LoRA, also rank‑8, is responsible for expressive aspects such as emotion, prosody, and speaking style. Only the Style‑LoRA parameters (the A and B low‑rank matrices) are exchanged with the server each communication round, reducing the total transmitted data to a few megabytes per round (≈0.04 % of the full model).

The backbone (GPT‑SoVITS‑V4) remains frozen throughout training, which allows the method to leverage powerful foundation models without the need for full‑model fine‑tuning. Training proceeds in two phases on each client: (i) timbre cloning, where a frozen speaker encoder and the ID‑LoRA are optimized to maximize cosine similarity between speaker embeddings of synthesized and reference audio; (ii) stylization, where only the Style‑LoRA is updated using expressive (emotional) utterances while gradients to the ID‑LoRA and backbone are blocked. In the experiments, each client performs 80 timbre‑cloning steps followed by 20 stylization steps per local epoch, and participates in 50 federated rounds with a 20 % client sampling rate.

To exploit the rich stylistic diversity across clients, the server implements a personalized aggregation inspired by collaborative filtering. After receiving Style‑LoRA updates from all participating clients, the server computes pairwise cosine similarities between the A (and separately B) matrices, applies a softmax with temperature τ = 0.5, and obtains attention weights α_ij and β_ij. The personalized Style‑LoRA for client i is then a weighted sum of all received adapters: A′_i = ∑_j α_ij A_j and B′_i = ∑_j β_ij B_j. This mechanism ensures that updates from stylistically similar peers have a larger influence, effectively “recommending” style information while preserving each user’s identity.

The authors evaluate Fed‑PISA on four public emotional speech corpora (ESD, EmoV‑DB, RAVDESS, CREMA‑D), unifying their emotion labels into ten style categories via emotion2vec. Metrics include Word Error Rate (WER), Style Expressivity (SE) measured by the probability of the ground‑truth emotion from an emotion classifier, Speaker Similarity (SS) via ECAPA‑TDNN cosine similarity, naturalness via nMOS (22‑listener blind test), and cumulative communication cost (GiB). Fed‑PISA achieves a WER of 0.704 % (vs. 2.70 % for the best federated baseline), SS of 0.645 (vs. 0.556), SE of 0.704 (vs. 0.416 for FedSpeech), and nMOS of 4.08 (vs. 3.77). Communication is reduced to 45.8 GiB across all rounds, a 3‑10× saving compared with FedSpeech (145.28 GiB) and Federated Dynamic Transformer (456.35 GiB).

Ablation studies confirm the necessity of both adapters: removing ID‑LoRA collapses speaker similarity, while removing Style‑LoRA degrades style expressivity and naturalness. Varying the proportion of timbre‑cloning versus stylization steps shows a clear trade‑off: more timbre steps improve SS, whereas more stylization steps boost nMOS, guiding practitioners on how to balance identity preservation and expressive quality.

In summary, Fed‑PISA makes three key contributions: (1) it demonstrates that PEFT‑based LoRA can be seamlessly integrated into modern large TTS backbones for federated learning, drastically cutting communication overhead; (2) it cleanly separates identity and style into private and shared adapters, achieving strong privacy guarantees while still enabling cross‑client style transfer; (3) it introduces a collaborative‑filtering‑style personalized aggregation that leverages stylistic heterogeneity rather than suppressing it. These advances open the door to privacy‑preserving, highly expressive voice cloning on edge devices, with potential applications in personalized virtual assistants, multilingual dubbing, and custom audio content generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment