Pairwise Evaluation of Accent Similarity in Speech Synthesis
Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.
💡 Research Summary
This paper addresses the largely unexplored problem of evaluating accent similarity in speech synthesis. The authors improve both subjective listening tests and objective acoustic metrics to provide reliable, cost‑effective assessment tools.
For the subjective side they start from the widely used XAB test, where listeners choose which of two candidate utterances sounds more like a reference in terms of accent. They add three enhancements: (1) display the transcript of the utterances so listeners can focus on pronunciation differences, (2) ask listeners to highlight the exact words or phonetic segments that influenced their decision, and (3) implement a two‑step screening procedure consisting of attention‑check trials and an open‑ended accent‑identification question. These additions dramatically increase statistical power: with as few as ten to fifteen valid submissions the preference for the better‑accented system becomes significant (p < 0.05), whereas the plain XAB test fails to discriminate even with more listeners.
On the objective side the authors propose pronunciation‑focused metrics. First, they extract vowel formants (F1, F2) using a forced‑aligner and Fast‑Track estimator, then compute a root‑mean‑square error (VF RMSE) between two systems. Second, they extract phonetic posteriorgrams (PPGs), align them with dynamic time warping, and compute either cosine distance or Jensen‑Shannon divergence averaged over the alignment path. Both metrics directly capture phonetic identity, which is the core of accent. In addition they evaluate a broad set of existing measures: cosine similarity of accent‑identification (AID) embeddings (GenAID, CommonAccent), speaker‑verification embeddings (WavLM‑SV), mel‑cepstral distortion (MCD), Whisper‑based word/character error rates, UTMOS for naturalness, and F0‑related errors.
Experiments focus on the under‑represented Edinburgh English accent. The reference is a female VCTK speaker; the test speaker is a male VCTK speaker with a similar accent. Three systems are compared: (a) “copysyn”, which simply vocodes the ground‑truth mel‑spectrogram (high accent fidelity, minor vocoder artefacts), (b) XTTS, a zero‑shot TTS model without explicit accent control (pronounced accent errors), and (c) a series of “corrupt” models obtained by fine‑tuning XTTS on a General‑American corpus (LJ Speech) for 30 k–150 k steps, hypothesised to cause catastrophic forgetting of non‑GA accents.
Subjective results show that the plain XAB test does not reveal a clear preference (≈50 % for copysyn). Adding the transcript (+trans) pushes the preference to ~57 % and, when combined with screening (+trans+screen), yields a statistically significant advantage (p < 0.05). Adding the highlighting task (+highlight) further strengthens the effect, achieving significance with only ~10 valid listeners. Thus, providing textual context and an auxiliary annotation task makes listeners far more sensitive to subtle accent differences.
Objective correlations are reported using Spearman rank correlation with the hypothesised quality ranking (copysyn > XTTS > corrupt). VF RMSE, PPG‑CosSim, and PPG‑JS all achieve very high correlations (≥0.93) and low p‑values, confirming that these pronunciation‑based distances track perceived accent similarity. Cosine similarity of AID embeddings also correlates, though less strongly, while speaker‑verification similarity shows moderate correlation. Traditional intelligibility metrics (WER, CER) perform poorly: they are biased toward the majority General‑American accent and thus cannot reliably assess under‑represented accents. MCD and F0‑related measures correlate more with audio quality and naturalness than with accent.
The authors conclude that (1) a refined XAB test with transcript, highlighting, and rigorous screening provides a cheap yet statistically powerful method for human evaluation of accent similarity, and (2) pronunciation‑centric objective metrics (vowel formant RMSE, PPG‑based distances) are effective proxies for accent similarity, whereas WER‑based metrics are unsuitable for low‑resource or minority accents. They suggest future work on extending these methods to multilingual settings, building tighter mappings between human judgments and automatic scores, and developing real‑time, fully automatic accent similarity assessment tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment