Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.
💡 Research Summary
ProtoDisent‑TTS tackles two persistent challenges in dysarthric speech technology: (1) the entanglement of speaker identity with pathological articulation, which hampers controllability and limits the usefulness of synthetic data, and (2) the scarcity of accurately transcribed dysarthric recordings, which restricts robust ASR training. The authors build on a large‑scale pre‑trained Index‑TTS model and introduce two complementary mechanisms. First, a learnable pathology prototype codebook P = {p₀,…,pₙ} stores a healthy prototype (p₀) and one prototype per dysarthric patient or severity level (p₁…pₙ). At synthesis time, a short prompt utterance from the target speaker is encoded into a timbre embedding s, and the desired prototype p_k is selected by index. The two vectors are summed (z = s + p_k) and fed together with the input text to the TTS decoder, yielding speech that preserves the speaker’s voice while exhibiting the articulation pattern defined by the prototype. By simply swapping the prototype index, the system can perform bidirectional conversion between healthy and dysarthric speech, enabling both data augmentation (healthy‑to‑dysarthric) and reconstruction (dysarthric‑to‑healthy) within a single pipeline. Second, to enforce a clean factorisation of speaker and pathology information, the authors employ a dual‑classifier scheme. A dysarthria condition classifier C_dys operates on the combined representation z and learns to distinguish healthy from pathological speech, encouraging the prototype embedding to capture the pathology. Simultaneously, an adversarial classifier C_adv is attached directly to the speaker embedding s and is equipped with a Gradient Reversal Layer (GRL). C_adv is trained to predict the same pathology label, but the GRL inverts its gradient (scaled by λ = 0.1), forcing s to become invariant to dysarthric cues. The total loss L_total = L_TTS + α L_Cdys + β L_Cadv (α = β = 1) balances standard TTS reconstruction with the two disentanglement objectives. To further strengthen separation, the authors generate cross‑condition timbre‑shifted pairs by converting dysarthric utterances to the timbre of random healthy speakers and vice‑versa, exposing the model to diverse speaker‑pathology combinations during training. Implementation details include LoRA adapters (rank = 16) applied only to the attention and MLP layers of Index‑TTS, while the rest of the backbone remains frozen. Only the perceiver (speaker encoder), the prototype codebook, and the dual classifiers are fine‑tuned, with learning rates of 2.5 × 10⁻³, 2.5 × 10⁻⁴, and 2.5 × 10⁻⁴ respectively. The prototype set size equals the number of dysarthric patients plus one healthy prototype. Experiments are conducted on the TORGO corpus (8 dysarthric speakers with four severity levels and 7 healthy controls). Three evaluation scenarios are explored using Whisper‑Medium and Whisper‑Large ASR back‑ends under a leave‑one‑speaker‑out protocol. (1) Synthetic data substitution: training ASR solely on fully synthetic dysarthric speech generated by ProtoDisent‑TTS yields substantial WER reductions compared with the unadapted pre‑trained models, and the gap to models trained on real dysarthric data is modest (≈2‑3 % absolute WER). (2) Healthy‑to‑dysarthric augmentation: mixing varying proportions of synthetic dysarthric speech (generated from healthy timbres plus diverse prototypes) with real data progressively improves ASR performance; the best gains appear when synthetic data constitute 60‑80 % of the training set. (3) Dysarthric‑to‑healthy reconstruction: when the system reconstructs intelligible healthy speech from dysarthric inputs while preserving speaker identity, cosine similarity between original and reconstructed speaker embeddings rises from ~0.16 (baseline) to ~0.35‑0.37, demonstrating effective disentanglement. Across all experiments, ProtoDisent‑TTS outperforms prior state‑of‑the‑art methods (FS2D, FMLLR‑DNN, SD‑CTL, TTDS) on TORGO WER, especially for severe and moderate‑severe cases. In summary, the paper delivers a novel, controllable, and interpretable TTS framework that cleanly separates speaker timbre from dysarthric articulation via prototype embeddings and adversarial training. This enables scalable, speaker‑consistent data augmentation and high‑fidelity speech reconstruction, addressing the data scarcity and variability inherent to dysarthric speech. Future work may explore expanding the prototype set to capture finer-grained pathological nuances, multilingual extensions, and integration with end‑to‑end ASR‑TTS joint training.
Comments & Academic Discussion
Loading comments...
Leave a Comment