Bayesian Speech Synthesizers Can Learn from Multiple Teachers
Text-to-Speech (TTS) is inherently a “one-to-many” mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, they typically rely on a fixed-variance prior, fundamentally constraining generation to a static point estimate that ignores the dynamic variability of natural speech. To bridge this gap, we propose BELLE (Bayesian evidential learning with language modelling), a framework that shifts from deterministic prediction to principled Bayesian inference without increasing model parameters or inference latency. By modeling the acoustic target as a Normal-Inverse-Gamma distribution, BELLE captures data-dependent aleatoric uncertainty. To enable accurate variance estimation on standard single-reference datasets, we introduce a “one-to-many” training strategy that leverages synthetic samples as a statistical support set, allowing the model to learn robust distributional properties rather than merely imitating teacher artifacts. Experiments demonstrate that BELLE, trained on only ~5k hours of data, outperforms leading open-source models trained on 50k hours (achieving a 25.8% relative WER reduction) and naturally supports high-quality streaming generation. Audio samples are available at https://belletts.github.io/Belle/.
💡 Research Summary
The paper introduces BELLE (Bayesian Evidential Learning with Language Modeling), a novel framework that brings principled Bayesian inference to continuous‑valued autoregressive (AR) text‑to‑speech (TTS) synthesis without adding parameters or inference latency. Traditional TTS models treat synthesis as a deterministic regression problem, often assuming a fixed variance prior. This simplification ignores the intrinsic “one‑to‑many” nature of speech, where the same text can be realized with many prosodic and acoustic variations. BELLE addresses this by modeling each mel‑spectrogram frame as a Normal‑Inverse‑Gamma (NIG) distribution, predicting its four hyper‑parameters (γ, ν, α, β) directly from the AR language model’s hidden state. The NIG formulation enables data‑dependent (heteroscedastic) uncertainty modeling: the mean and variance are jointly inferred, and the predictive posterior becomes a multivariate Student‑t distribution.
Training uses an evidential loss composed of a negative log‑likelihood term and a regularization term that enforces the mathematical constraints of the NIG parameters (α > 1, β > 0, ν > 0). This loss allows the network to learn both accurate point predictions and reliable uncertainty estimates in a single forward pass, unlike Bayesian neural networks that require multiple stochastic forward passes or Monte‑Carlo dropout. However, estimating variance reliably normally requires multiple observations per input, which standard TTS corpora lack (they provide only one audio per text). To overcome this, the authors propose a “one‑to‑many” training strategy: they generate synthetic utterances for each text using a diverse set of pre‑trained TTS teachers (e.g., MELLE, VALL‑E, other open‑source models). These synthetic samples are not used as targets for imitation; instead, they serve as a statistical support set that provides multiple observations of the same linguistic content, enabling the model to learn the distributional shape of the acoustic space. By aggregating these diverse realizations, BELLE learns robust variance estimates while mitigating the influence of any single teacher’s artifacts.
Architecturally, BELLE retains the standard components of modern mel‑based AR TTS: a Prenet that embeds text tokens and previous mel frames, a Transformer‑style AR language model, a Sampling Module (now evidential), a Postnet for refinement, a Stop‑Prediction module, and a neural vocoder. The only modification is the addition of a linear layer that outputs the NIG parameters, followed by appropriate activations (softplus, shifted softplus) to satisfy positivity constraints. During inference, the model draws a variance sample from an inverse‑Gamma distribution, then a mean sample from a Gaussian conditioned on that variance, and finally a mel frame from the resulting Gaussian—all in a deterministic pipeline that requires only one forward pass. Consequently, BELLE incurs no extra latency compared to conventional Gaussian‑based AR models, making it suitable for real‑time streaming synthesis.
Experiments were conducted on a ~5 k‑hour multilingual dataset (Chinese and English). BELLE was compared against a strong baseline (MELLE) that uses a fixed‑variance Gaussian prior, as well as several open‑source models trained on ~50 k hours. Results show a 25.8 % relative reduction in word error rate (WER) over the baseline and consistent improvements in mean opinion score (MOS) by 0.15–0.2 points. Streaming evaluations using chunk‑based generation demonstrated smooth transitions and no perceptible quality drop, confirming that the evidential sampling does not hinder low‑latency operation. Ablation studies revealed that (i) removing the synthetic support set degrades variance estimation and raises WER by ~12 %, and (ii) omitting the regularization term leads to pathological α/β values, causing overly noisy samples.
The authors acknowledge limitations: the quality of the synthetic support set depends on the teacher models, and the current NIG formulation assumes diagonal covariance, ignoring inter‑frame correlations. Future work may explore full‑covariance Bayesian models, alternative priors, or automated generation and curation of synthetic support data to further improve data efficiency and uncertainty modeling.
In summary, BELLE demonstrates that Bayesian evidential learning can be seamlessly integrated into continuous‑valued AR TTS, providing principled uncertainty quantification, diversity in generated speech, and streaming‑ready performance without additional computational cost. This represents a significant step toward more realistic, robust, and flexible speech synthesis systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment