UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
💡 Research Summary
UniverSR introduces a vocoder‑free, flow‑matching based framework for audio super‑resolution (ASR) that directly models the conditional distribution of complex‑valued spectral coefficients. Traditional two‑stage pipelines first upsample a low‑resolution mel‑spectrogram and then rely on a pre‑trained neural vocoder to synthesize waveforms, which introduces a hard quality ceiling due to the vocoder’s limited phase reconstruction capability. In contrast, UniverSR bypasses the mel representation entirely: a low‑resolution waveform is first sinc‑interpolated to the target length, transformed with STFT, and split into a low‑frequency band (Xₗ) and a missing high‑frequency band (Xₕ). The high‑frequency band is treated as a spectrum‑inpainting problem and generated by a conditional flow‑matching network.
The core component, the Vector Field Estimator (VFE), is a U‑Net built from ConvNeXt‑V2 blocks. It receives rich conditioning: (1) acoustic features extracted from Xₗ via a dedicated feature encoder, (2) sinusoidal positional embeddings for frequency bins, and (3) global context embeddings for time and sampling‑rate. FiLM modulation combines acoustic and positional information, while global embeddings are added inside each ConvNeXt block, enabling the model to adapt to various input rates and temporal contexts.
Conditional Flow Matching (CFM) defines a Gaussian path pₜ(x|x₁)=𝒩(x; μₜx₁, σₜ²I) with μₜ=t and σₜ=1−(1−σ_min)·t. The target vector field uₜ(x|x₁)=x₁−(1−σ_min)·x is learned by minimizing the L₂ distance between the network’s prediction v_θ(ψₜ(x), t, c) and uₜ. During training, the low‑frequency conditioning cₗf is randomly replaced with a null embedding to enable classifier‑free guidance (CFG).
At inference, a sample x₀∼𝒩(0,I) is evolved through the ODE dxₜ/dt = v_θ(xₜ, t, c) using a numerical solver from t=0 to 1. CFG is applied by interpolating between the unconditional and conditional vector fields with a guidance scale ω (default 1.5). The generated high‑frequency spectrum ˆXₕ is concatenated with Xₗ, inverse power‑law scaling is applied, and iSTFT reconstructs the high‑resolution waveform.
Experiments train two models: a unified model on a large, diverse corpus (speech, music, sound effects; ~731 h) and a speech‑only model on VCTK. Input sampling rates of 8, 12, 16, and 24 kHz are randomly selected during training, covering upsampling factors from ×2 to ×6. The unified model contains only 57 M parameters, far fewer than AudioSR (672 M) and FlashSR (639 M).
Objective metrics (LSD‑HF and the perceptual 2f‑model) show that UniverSR consistently outperforms baselines on music and sound‑effect domains, while achieving competitive scores on speech. Subjective MOS tests with 12 expert listeners reveal that UniverSR attains the highest average MOS across all domains, even surpassing vocoded ground truth, indicating that the removal of the vocoder eliminates subtle pitch instabilities often introduced by neural vocoders.
Key contributions are: (1) a fully end‑to‑end vocoder‑free ASR pipeline that directly predicts complex spectra, (2) the use of flow matching to drastically reduce sampling steps compared to diffusion models, and (3) a single, lightweight architecture capable of handling multiple audio domains and sampling rates. By preserving phase information intrinsically, UniverSR opens the door for high‑fidelity applications such as speech enhancement, music restoration, and broader generative audio tasks without the constraints of traditional vocoder‑based pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment