Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Emotional text-to-speech seeks to convey affect while preserving intelligibility and prosody, yet existing methods rely on coarse labels or proxy classifiers and receive only utterance-level feedback. We introduce Emotion-Aware Stepwise Preference Optimization (EASPO), a post-training framework that aligns diffusion TTS with fine-grained emotional preferences at intermediate denoising steps. Central to our approach is EASPM, a time-conditioned model that scores noisy intermediate speech states and enables automatic preference pair construction. EASPO optimizes generation to match these stepwise preferences, enabling controllable emotional shaping. Experiments show superior performance over existing methods in both expressiveness and naturalness.

💡 Research Summary

The paper tackles a central challenge in emotional text‑to‑speech (TTS): delivering fine‑grained, temporally consistent affect while preserving intelligibility and natural prosody. Existing approaches either rely on coarse emotion labels or proxy classifiers, and they typically receive feedback only at the utterance level. Consequently, they cannot adequately supervise the dynamic evolution of emotion throughout a generated utterance.

To address this gap, the authors introduce Emotion‑Aware Stepwise Preference Optimization (EASPO), a post‑training framework that aligns diffusion‑based TTS models with dense, step‑wise emotional preferences. The core idea is to treat each denoising step of a diffusion model as a small Markov decision process (MDP) and to provide a preference signal that is specific to that step. At step t, given the current latent xₜ, the model samples a small set (k = 4 in the main experiments) of candidate mel‑spectrograms xᵢ^{t‑1} from the diffusion transition distribution pθ(x^{t‑1}|xₜ, c).

These candidates are evaluated by an Emotion‑Aware Stepwise Preference Model (EASPM), a time‑conditioned scorer built on top of CLEP, a CLAP‑style contrastive audio‑language encoder pre‑trained on large‑scale speech data. EASPM computes a cosine similarity between the audio embedding of each candidate (conditioned on the current timestep) and the text embedding of the target emotion prompt. The highest‑scoring candidate is designated the “win” and the lowest‑scoring as the “lose.” A logistic model with temperature τ converts the score difference Δₜ = s_w − s_l into a win‑probability ˆp_w = σ(τΔₜ).

The preference loss for the pair is L_pref = −log ˆp_w, which encourages the model to increase the score gap between win and lose. Crucially, after the win/lose pair is identified, the next denoising step does not continue with the win sample; instead, a candidate is chosen uniformly at random from the pool. This random continuation prevents the policy from collapsing onto a narrow trajectory and encourages exploration of diverse emotional paths.

EASPO’s training objective aligns the policy’s log‑likelihood ratio between win and lose samples with the dense emotional reward difference supplied by EASPM. For each candidate j ∈ {w,l}, the log‑likelihood ratio is ρ_jₜ(θ) = log πθ(x_j^{t‑1}|sₜ) − log π_ref(x_j^{t‑1}|sₜ), where π_ref is a frozen reference diffusion policy (the original pre‑trained model). The difference Δρₜ = ρ_wₜ − ρ_lₜ is then matched to the reward difference ΔbRₜ = bR_wₜ − bR_lₜ via a mean‑squared error weighted by a time‑dependent factor βₜ = λ^{(T‑t‑1)}/η. This weighting gives larger emphasis to earlier steps where the latent still contains substantial noise, while still allowing later steps to fine‑tune prosodic details. The final loss is

L(θ) = E_{c, x_T, τ, win/lose}

Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment