Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of “Edit Content, Preserve Acoustics”. Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic – complemented by strict intelligibility and duration constraints – we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.


💡 Research Summary

The paper tackles the challenging problem of “imperceptible” text‑based speech editing, where a user modifies the transcript of an utterance (inserting, deleting, or substituting words) and expects the edited audio to blend seamlessly with the surrounding context. Existing approaches either operate directly on acoustic tokens (autoregressive neural codec language models) or on acoustic features (non‑autoregressive diffusion models). Both suffer from entanglement of linguistic content and speaker/style information, which leads to hallucinations, boundary artifacts, and unstable prosody when the content is altered.

To overcome these limitations, the authors propose a two‑stage framework grounded in the principle “Edit Content, Preserve Acoustics”.

  1. Structural Foundations – Semantic Space Editing

    • The raw audio is first encoded into a discrete sequence of semantic tokens using a pretrained tokenizer (the same tokenizer used in CosyVoice).
    • Editing is performed entirely in this semantic space. The input to the editing model follows a Prefix‑Suffix‑Middle (PSM) format: the target text is concatenated with the semantic tokens preceding and following the region to be edited, while the middle segment (the region to be changed) is masked.
    • A decoder‑only transformer (treated as a policy πθ) is trained in a supervised fashion to predict the missing middle tokens, minimizing the standard negative log‑likelihood. Because only linguistic information (and coarse prosody) is represented in the semantic tokens, the model can modify content without disturbing timbre, room acoustics, or speaker identity.
    • After the semantic tokens are generated, a Flow Matching decoder reconstructs a high‑fidelity mel‑spectrogram, and a frozen HiFi‑GAN vocoder converts it to waveform. The acoustic reconstruction module is kept frozen, ensuring that the acoustic manifold remains stable and that any artifacts arise only from the semantic stage.
  2. Perceptual Alignment – Self‑Consistency Rewards GRPO

    • Even with a clean semantic representation, the generated tokens may still produce unnatural rhythm, prosodic mismatches, or unintelligible speech. To enforce “imperceptibility”, the authors introduce a reinforcement‑learning stage that aligns the edited tokens with the distribution of natural speech.
    • The key novelty is the use of a large, pretrained Text‑to‑Speech (TTS) model as an implicit critic. For a candidate edited token sequence Ŝmid, the average log‑probability under the frozen TTS model (r_sc) is computed. Maximizing this reward is mathematically equivalent to minimizing the KL divergence between the policy distribution and the TTS model’s distribution, thereby keeping the policy inside the high‑probability manifold of natural speech.
    • To prevent the policy from exploiting the likelihood reward (e.g., by producing silence or repetitive patterns), an Intelligibility Reward (r_wer) is added. The edited waveform is synthesized, fed to a strong ASR system (Whisper), and the word error rate against the target transcript is penalized.
    • A gated reward aggregation scheme further filters out samples that violate minimal quality thresholds (WER > τ_wer or duration mismatch > τ_len). Only valid samples receive the combined reward R = R_base·r_sc·r_wer; invalid samples receive zero, which stabilizes training.
    • For policy optimization, the authors adopt Group Relative Policy Optimization (GRPO), a variant of PPO that does not require a separate value network. For each editing query Q, G candidate sequences are sampled from the current policy. The relative advantage of each candidate is computed as (R_i – mean(R))/std(R), providing a baseline derived from the group itself. This group‑wise normalization reduces variance and encourages the policy to improve relative to its peers.
  3. Training and Evaluation

    • The semantic LLM is pretrained on the massive Libriheavy corpus (≈50k h of English speech). The Flow Matching decoder and HiFi‑GAN vocoder are borrowed from CosyVoice and kept frozen throughout.
    • RL fine‑tuning uses a learning rate of 5e‑6, batch size 32, and G=8 candidates per query.
    • Evaluation is performed on two benchmarks derived from the Ming‑Freeform‑Audio‑Edit benchmark and a custom set with random edit masks of lengths 0.5 s to 2.5 s. Baselines include:
      FluentSpeech (non‑autoregressive diffusion on mel‑spectrograms),
      VoiceCraft (autoregressive neural‑codec language model on acoustic tokens),
      Ming‑UniAudio (unified LLM handling speech understanding, editing, and generation).
    • Metrics: Word Error Rate (WER) on the full edited utterance (using Whisper), speaker similarity (cosine similarity of WavLM embeddings), DNS‑MOS (neural perceptual quality), and human Mean Opinion Score (MOS) collected from 8 listeners on 90 samples.
  4. Results

    • Across insertion, deletion, and substitution tasks, the proposed method consistently achieves lower WER (≈4.5 % vs. 10–12 % for baselines), higher speaker similarity (≈0.82 vs. 0.60–0.79), and comparable or better DNS‑MOS.
    • The GRPO‑enhanced version further reduces WER to 4.5 % and raises MOS to 4.08 / 5, indicating that the self‑consistency reward and group‑relative optimization effectively improve both intelligibility and naturalness.
    • Ablation studies (not fully detailed in the excerpt) likely show that removing the TTS‑based log‑probability reward or the gated validity filter leads to higher hallucination rates and degraded MOS, confirming the importance of each component.
  5. Contributions and Impact

    • Semantic‑Space Decoupling: By editing in a disentangled token space, the method eliminates the content‑style entanglement that plagues acoustic‑token approaches.
    • Self‑Consistency Rewards: Introducing a pretrained TTS model as a statistical critic is a novel way to align generated speech with natural distributions without requiring paired ground‑truth edited audio.
    • GRPO: The group‑relative policy optimization offers a lightweight, variance‑reduced RL scheme suitable for discrete token generation.
    • Empirical Validation: Extensive experiments on large‑scale data demonstrate state‑of‑the‑art performance, bringing text‑based speech editing closer to production‑grade quality.

In summary, the paper presents a well‑structured, theoretically motivated, and empirically validated solution for imperceptible text‑based speech editing. By separating content manipulation from acoustic reconstruction and by guiding the policy with self‑consistency rewards derived from a powerful TTS model, the authors achieve a harmonious blend of edited and original speech, setting a new benchmark for future research in editable speech synthesis.


Comments & Academic Discussion

Loading comments...

Leave a Comment