FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning
Generative speech enhancement offers a promising alternative to traditional discriminative methods by modeling the distribution of clean speech conditioned on noisy inputs. Post-training alignment via reinforcement learning (RL) effectively aligns generative models with human preferences and downstream metrics in domains such as natural language processing, but its use in speech enhancement remains limited, especially for online RL. Prior work explores offline methods like Direct Preference Optimization (DPO); online methods such as Group Relative Policy Optimization (GRPO) remain largely uninvestigated. In this paper, we present the first successful integration of online GRPO into a flow-matching speech enhancement framework, enabling efficient post-training alignment to perceptual and task-oriented metrics with few update steps. Unlike prior GRPO work on Large Language Models, we adapt the algorithm to the continuous, time-series nature of speech and to the dynamics of flow-matching generative models. We show that optimizing a single reward yields rapid metric gains but often induces reward hacking that degrades audio fidelity despite higher scores. To mitigate this, we propose a multi-metric reward optimization strategy that balances competing objectives, substantially reducing overfitting and improving overall performance. Our experiments validate online GRPO for speech enhancement and provide practical guidance for RL-based post-training of generative audio models.
💡 Research Summary
This paper introduces the first successful integration of online reinforcement learning (RL) into a flow‑matching speech‑enhancement (SE) framework, leveraging Group Relative Policy Optimization (GRPO) to align a generative SE model with perceptual and downstream metrics. Traditional generative SE models learn a conditional distribution of clean speech given noisy input, typically by maximizing likelihood via flow‑matching, which predicts a velocity field that transports a standard Gaussian to the clean mel‑spectrogram. While such models can produce high‑quality audio, they are not directly optimized for human‑centric metrics such as DNSMOS, speaker similarity, or Speech‑BER‑TScore.
The authors adapt GRPO—originally proposed for large language models—to the continuous, time‑series domain of speech. They first convert the deterministic ODE sampler of the flow‑matching model into an equivalent stochastic differential equation (SDE) to introduce the randomness required for on‑policy RL. Because SE uses the opposite time direction (0 → 1) compared with the original Flow‑GRPO (1 → 0), they carefully re‑parameterize the equations and introduce a “window training” scheme: only a small early subset of denoising steps (e.g., two steps out of ten) are sampled stochastically, while the remaining steps follow the deterministic ODE. This dramatically reduces computational load while preserving the ability to explore policy space.
In the GRPO formulation, each noisy prompt c defines a Markov decision process (MDP) where the state consists of the noisy mel, the current time step, and the current latent sample x_t. The action is the next latent sample x_{t‑1} generated by the model’s conditional Gaussian policy π_θ. A reward is given only at the final step after the mel is decoded to waveform; the reward can be any downstream metric. For each prompt, the policy generates a group of G candidate outputs (G=10 in the experiments) using the mixed ODE‑SDE sampler, and the group‑wise advantage A_i is computed as the z‑score of each candidate’s reward within the group. The GRPO loss combines a clipped importance‑weight term with a KL‑regularization toward a reference policy, encouraging stable updates.
The authors first explore single‑metric optimization. Optimizing only DNSMOS quickly raises the non‑intrusive quality score from ~3.36 to ~3.59 within ~1.5 k steps, but introduces “reward hacking”: the model learns to inflate DNSMOS by adding artifacts that do not improve perceived quality. Similar phenomena appear when optimizing only speaker similarity or Speech‑BER‑TScore; the other metrics degrade. To mitigate this, they propose a multi‑metric reward: a weighted sum of normalized DNSMOS, speaker similarity, and Speech‑BER‑TScore, with λ₁=0.6, λ₂=λ₃=1. This balances the competing objectives and prevents any single metric from dominating the learning signal.
Experiments are conducted on the DNS‑2020 challenge test sets (No‑Reverb, With‑Reverb, Real‑Recording). The base flow‑matching model (FlowSE‑FM) already matches or slightly exceeds prior flow‑based SE baselines (Flow‑SR). After 5 k GRPO steps with the multi‑metric reward, the model improves overall DNSMOS (OVRL) from 3.373 to 3.549 (+0.176), speaker similarity from 88.88 % to 90.43 % (+1.55 pp), and Speech‑BER‑TScore from 86.35 % to 86.72 % (+0.37 pp) on the No‑Reverb set. Similar gains are observed on With‑Reverb (+0.276 OVRL) and Real‑Recording (+0.241 OVRL). Compared with offline Direct Preference Optimization (DPO), which requires 20 k steps, GRPO achieves larger metric improvements with only 5 k steps, demonstrating the efficiency of on‑policy learning.
Ablation studies examine the impact of the SDE noise level a and the window‑training strategy. Larger a (e.g., 0.4) expands exploration and speeds up early learning but can exacerbate reward hacking if too high. The window‑training (sampling only the first two steps) reduces training time by ~80 % without sacrificing performance, confirming that stochasticity is most needed early in the diffusion trajectory.
In summary, the paper makes several key contributions: (1) adapts GRPO to continuous speech enhancement, handling reversed time ordering and flow‑matching dynamics; (2) introduces a practical multi‑metric reward to avoid reward hacking; (3) demonstrates that online on‑policy RL can outperform offline preference optimization with far fewer updates; and (4) provides extensive empirical analysis on standard SE benchmarks. The work opens the door for rapid post‑training alignment of generative audio models to human‑centric criteria, which is crucial for real‑world deployment where perceptual quality and downstream task performance matter more than raw likelihood.
Comments & Academic Discussion
Loading comments...
Leave a Comment