Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams

Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Synthesizing second-language (L2) speech is potentially highly valued for L2 language learning experience and feedback. However, due to the lack of L2 speech synthesis datasets, it is difficult to synthesize L2 speech for low-resourced languages. In this paper, we provide a practical solution for editing native speech to approximate L2 speech and present PPG2Speech, a diffusion-based multispeaker Phonetic-Posteriorgrams-to-Speech model that is capable of editing a single phoneme without text alignment. We use Matcha-TTS’s flow-matching decoder as the backbone, transforming Phonetic Posteriorgrams (PPGs) to mel-spectrograms conditioned on external speaker embeddings and pitch. PPG2Speech strengthens the Matcha-TTS’s flow-matching decoder with Classifier-free Guidance (CFG) and Sway Sampling. We also propose a new task-specific objective evaluation metric, the Phonetic Aligned Consistency (PAC), between the edited PPGs and the PPGs extracted from the synthetic speech for editing effects. We validate the effectiveness of our method on Finnish, a low-resourced, nearly phonetic language, using approximately 60 hours of data. We conduct objective and subjective evaluations of our approach to compare its naturalness, speaker similarity, and editing effectiveness with TTS-based editing. Our source code is published at https://github.com/aalto-speech/PPG2Speech.


💡 Research Summary

The paper addresses the challenge of synthesizing second‑language (L2) speech for low‑resource languages, using Finnish as a test case. Because L2 speech corpora are scarce, the authors propose a practical alternative: edit native speech to approximate typical L2 pronunciation errors. Their solution, PPG2Speech, is a diffusion‑based, multi‑speaker model that converts Phonetic Posteriorgrams (PPGs) directly into mel‑spectrograms, without requiring any textual input or hard alignment.

The backbone is the flow‑matching decoder from Matcha‑TTS, which transforms a simple Gaussian prior into realistic mel‑spectrograms by learning a deterministic vector field. To improve fidelity, the authors augment this decoder with Classifier‑Free Guidance (CFG) and Sway Sampling. CFG is implemented by jointly training conditional and unconditional versions of the model and then linearly combining their score estimates at inference with a guidance weight w = 3. Sway Sampling (s = −1) uses smaller steps in the early diffusion stages and larger steps later, which empirically reduces noise and yields higher‑quality audio.

The model architecture consists of a PPG encoder (3‑layer convolutional prenet → Conformer layers → nearest‑neighbor upsampling → Transformer layers) that maps 32‑dim PPGs to a time resolution matching the target mel‑spectrogram. The decoder is a 1‑D U‑Net conditioned on external speaker embeddings (256‑dim, extracted with a pre‑trained SimAMResNet34) and pitch embeddings (256‑bin quantized pitch from PENN, concatenated with a voiced/unvoiced flag). During training, 10 % of batches drop the PPG latent and conditioning to train the unconditional branch needed for CFG. At synthesis time, the model generates mel‑spectrograms which are rendered into waveforms by a pre‑trained HiFi‑GAN vocoder.

To evaluate how well the edited speech matches the intended phonetic changes, the authors introduce Phonetic Aligned Consistency (PAC). PAC aligns the edited region of the source PPG with the corresponding region extracted from the synthesized audio using Dynamic Time Warping (DTW), then computes the Jensen‑Shannon distance between the aligned frames and averages the result, normalizing by the length of the edited segment. This metric captures both phonetic fidelity and timing consistency.

Experiments use two Finnish corpora—Perso Synteesi (≈18 h, 50 speakers) and Finsyn (≈44 h, 2 speakers)—totaling about 62 hours. The data are split into training (≈53 h, 48 speakers), validation, test (≈2.7 h), and an “unseen” test set containing four speakers excluded from training. PPGs are extracted with a Kaldi HMM‑TDNN model (29 phonemes + 3 special symbols). Speaker embeddings and pitch are extracted as described above.

Objective evaluation includes Speaker Encoder Cosine Similarity (SECS) for speaker similarity, Character Error Rate (CER) from a fine‑tuned wav2vec2‑large model for intelligibility, Mel‑Cepstral Distortion (MCD) and Pitch Mean Absolute Error (Pitch MAE) for unseen‑speaker handling, and PAC for editing effectiveness. Results show that the baseline Matcha‑TTS with CFG matches ground‑truth speaker similarity (SECS ≈ 0.89) and achieves low CER (≈ 3.9 %). PPG2Speech without CFG lags behind (SECS ≈ 0.83, CER ≈ 5.4 %). Adding CFG to PPG2Speech narrows the gap (SECS ≈ 0.86, MCD ≈ 3.69 dB, Pitch MAE ≈ 7.2 cents) and improves PAC (0.709 vs. 0.804 for Matcha‑TTS‑CFG). Subjective listening tests corroborate that PPG2Speech‑CFG produces natural‑sounding speech with reasonable speaker identity, while offering the advantage of text‑free phoneme‑level editing.

The paper’s contributions are threefold: (1) a novel PPG‑based, text‑free speech editing pipeline that can modify single phonemes without external aligners; (2) the integration of flow‑matching diffusion, CFG, and Sway Sampling to achieve high‑quality multi‑speaker synthesis; (3) the introduction of the PAC metric for quantitative assessment of phonetic editing. Limitations include dependence on the quality of the upstream PPG extractor and the relatively modest PAC scores compared to text‑based baselines, suggesting room for improvement in phonetic precision. Future work may explore more accurate PPG extraction, adaptive CFG weighting, multilingual extensions, and real‑time editing capabilities. Overall, the study demonstrates that diffusion‑based, PPG‑driven models are a viable solution for low‑resource L2 speech synthesis and pronunciation feedback.


Comments & Academic Discussion

Loading comments...

Leave a Comment