Latent Flow Matching for Expressive Singing Voice Synthesis
Reading time: 5 minute
...
📝 Original Info
Title: Latent Flow Matching for Expressive Singing Voice Synthesis
ArXiv ID: 2601.00217
Date: 2026-01-01
Authors: Minhyeok Yun, Yong-Hoon Choi
📝 Abstract
Conditional variational autoencoder (cVAE)-based singing voice synthesis provides efficient inference and strong audio quality by learning a score-conditioned prior and a recording-conditioned posterior latent space. However, because synthesis relies on prior samples while training uses posterior latents inferred from real recordings, imperfect distribution matching can cause a prior-posterior mismatch that degrades fine-grained expressiveness such as vibrato and micro-prosody. We propose FM-Singer, which introduces conditional flow matching (CFM) in latent space to learn a continuous vector field transporting prior latents toward posterior latents along an optimal-transport-inspired path. At inference time, the learned latent flow refines a prior sample by solving an ordinary differential equation (ODE) before waveform generation, improving expressiveness while preserving the efficiency of parallel decoding. Experiments on Korean and Chinese singing datasets demonstrate consistent improvements over strong baselines, including lower mel-cepstral distortion and fundamental-frequency error and higher perceptual scores on the Korean dataset. Code, pretrained checkpoints, and audio demos are available at https://github.com/alsgur9368/FM-Singer
💡 Deep Analysis
📄 Full Content
Singing voice synthesis aims to generate natural and expressive singing waveforms from symbolic musical scores such as lyrics/phonemes, note pitch, and note durations.
Compared to text-to-speech (TTS), singing voice synthesis must model a broader range of expressive phenomenavibrato, timing offsets relative to the beat, dynamic accents, breathiness, and singer-specific timbral traits-while remaining faithful to strict musical constraints such as pitch targets and note boundaries. Although neural singing voice synthesis has substantially improved pitch accuracy and audio fidelity, generating fine-grained expressiveness remains challenging because these attributes are highly variable across singers and musical contexts and appear as subtle, localized deviations in pitch and spectral envelope.
A common strategy for the one-to-many nature of singing expression is to introduce latent variables that capture performance-specific variability beyond the score. End-to-end architectures derived from efficient TTS have been adapted to singing voice synthesis, where a conditional variational autoencoder (cVAE) latent variable is combined with adversarial learning to enable parallel generation and highquality waveform synthesis [1]. VISinger and VISinger2 adopt a variational framework with adversarial training and signal-processing-inspired components, achieving strong results with efficient inference [2], [3]. Period Singer further highlights the importance of latent representations for singing characteristics by modeling periodic and aperiodic components with variational variants [4]. Despite these advances, cVAE-based singing voice synthesis typically uses a relatively simple score-conditioned prior and encourages prior-posterior alignment through Kullback-Leibler (KL) regularization. In practice, posterior latents inferred from real recordings during training can encode rich and multi-modal expressive cues, whereas inference uses samples from the prior; any residual mismatch can weaken expressiveness, particularly for oscillatory pitch patterns (vibrato) and subtle timbral fluctuations.
Recent advances in diffusion and flow-based generative modeling have been explored to improve detail and stability.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
inference cost due to multiple sampling steps [5]. In parallel, flow matching has emerged as a simulation-free method to train continuous normalizing flows by regressing a vector field along a chosen probability path, offering stable training and fewer numerical integration steps than many diffusion setups [6]. Flow matching has also been adopted for techniquecontrollable multilingual singing voice synthesis, indicating its potential for expressive generation [7]. Consistency-modelstyle approaches reduce the number of steps while maintaining quality, including work targeting speech and singing synthesis [8].
This paper proposes FM-Singer, which combines the efficiency of cVAE-based singing voice synthesis [2], [3] with the expressiveness of flow matching [6] by introducing latentspace conditional flow matching. Rather than learning a flow over waveform or spectrogram space, we learn a continuous vector field that transports score-conditioned prior latents toward recording-conditioned posterior latents, targeting the mismatch at its source. At inference time, the learned latent flow refines a prior sample via ODE integration before waveform generation, improving expressiveness while preserving efficient parallel decoding. In summary, this work introduces an explicit latent transport mechanism that mitigates prior-posterior mismatch in cVAE-based singing voice synthesis, presents a lightweight CFM module compatible with fast parallel decoding, and provides an empirical evaluation on Korean and Chinese benchmarks (including OpenCpop [9]) demonstrating improvements in objective metrics and perceptual quality.
The remainder of this paper is organized as follows. Section II presents the proposed architecture. Section III describes the training objective. Section IV reports experiments and analysis, and Section V concludes.
FM-Singer augments a conditional variational autoencoder (cVAE)-based singing voice synthesis backbone with a latentspace conditional flow matching (CFM) module. The overall training and inference pipeline is illustrated in Fig. 1, and the latent refinement process based on CFM and ordinary differential equation (ODE) integration is depicted in Fig. 2. The model consists of a prior encoder, a posterior encoder, a latent refinement module trained by CFM, and a waveform generator trained with adversarial learning.
Let 𝑐 denote the music-score conditioning, including phoneme/lyric tokens, note pitch, and note duration (or duration-related alignment). Let 𝑦 be the ground-truth singing waveform and 𝑥 = Mel(𝑦) be the corresponding melspectrogram. The goal is to synt