Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling

Tutti: Expressive Multi-Singer Synthesis via Structure-Level Timbre Control and Vocal Texture Modeling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi-singer generation. Specifically, we introduce a Structure-Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learning via Condition-Guided VAE to capture implicit acoustic textures (e.g., spatial reverberation and spectral fusion) that are complementary to explicit controls. Experiments demonstrate that Tutti excels in precise multi-singer scheduling and significantly enhances the acoustic realism of choral generation, offering a novel paradigm for complex multi-singer arrangement. Audio samples are available at https://annoauth123-ctrl.github.io/Tutii_Demo/.


💡 Research Summary

The paper addresses a fundamental limitation of current Singing Voice Synthesis (SVS) systems: they treat timbre as a global, static condition and therefore cannot model dynamic multi‑singer arrangements or the acoustic interactions that arise when several voices sing together. To overcome this, the authors propose Tutti, a unified framework that introduces two novel components – a Structure‑Aware Singer Prompt and a Condition‑Guided Variational Auto‑Encoder (VAE) for complementary texture learning – and integrates them into a latent diffusion transformer (DiT) backbone.

Structure‑Aware Singer Prompt
The system first segments a song into non‑overlapping structural units (verse, chorus, bridge) using SongPrep. For each unit, a CAM++ model extracts a singer embedding. By measuring cosine similarity between embeddings of verse sections, the method identifies distinct singers and builds a global singer set for the whole track. In chorus or bridge sections, the algorithm matches the current embedding against the global set; if multiple matches exceed a predefined similarity threshold, the segment is labeled as a multi‑singer part. All identified singer embeddings for a segment are then fused by a self‑attention‑based Adaptive Singer Prompt Fuser, producing a single, structure‑aware timbre condition C_singer that reflects both lead‑voice and harmony‑voice relationships. This dynamic prompt enables precise scheduling of soloists, duets, and full‑choir ensembles according to the musical form.

Condition‑Guided VAE for Texture Modeling
Standard VAEs compress all acoustic information—including pitch, content, timbre, and subtle texture—into a single latent vector, making it difficult to control texture independently. Tutti trains a vocal‑specific VAE (based on Stable Audio 2.0) on a dedicated singing dataset. During training, the encoder maps a clean vocal waveform y to a latent z, which is then corrupted with Gaussian noise to obtain \tilde{z}. The decoder receives explicit conditions (lyrics, structural labels, and the Structure‑Aware Singer Prompt) and is tasked with reconstructing the waveform from \tilde{z}. Because the explicit conditions already supply content and timbre, the decoder is forced to rely on the residual information in \tilde{z} to restore “texture” – aspects such as spatial reverberation, inter‑part spectral fusion, breathiness, and other implicit acoustic cues. The VAE loss combines multi‑resolution STFT, KL divergence, and an adversarial term to ensure high‑frequency fidelity and perceptual realism.

Overall Architecture
Tutti’s generation pipeline consists of: (1) a VAE encoder that extracts a texture latent Z_texture from a reference recording; (2) a DiT backbone (LLaMA‑style transformer) that predicts the target vocal latent z₀ conditioned on lyrics L, structure labels S, the fused singer prompt C_singer, and Z_texture. All conditions are aligned to the latent frame rate, broadcast or down‑sampled as needed, and concatenated with the noisy latent z_t at each diffusion step. The diffusion process follows a flow‑matching formulation, learning a velocity field v_θ(z_t, t, c) that transports noise to data. Timestep sampling uses a logit‑normal distribution, allowing flexible weighting between early‑noise and late‑data regimes. After diffusion, the VAE decoder reconstructs high‑fidelity waveforms from the predicted latent.

Experiments
The authors evaluate Tutti on a multilingual multi‑singer dataset covering Korean, Chinese, and English choral pieces. Baselines include DiffSinger, VISinger, and MV‑voice (a zero‑shot multi‑singer model). Metrics assess (a) singer‑scheduling accuracy, (b) a multi‑speaker BLEU‑like similarity score, (c) Frechet Audio Distance (FAD) for texture quality, and (d) subjective listening tests. Tutti achieves over 93 % correct singer‑scheduling, reduces FAD by 0.12–0.18 relative to baselines, and receives an average 4.3/5 rating for “choral realism” in human evaluations. An ablation that removes the texture VAE shows a noticeable drop in spatial perception and harmonic blending, confirming the importance of the complementary texture module.

Conclusion and Future Work
Tutti demonstrates that by jointly modeling structure‑level timbre control and implicit vocal texture, a single SVS system can generate expressive, realistic multi‑singer performances within one song—a capability previously unavailable. Future directions include real‑time inference optimization, zero‑shot adaptation to new languages and musical styles, and richer joint modeling of lyrics, melody, and rhythm to capture higher‑order musical structures. Overall, Tutti marks a significant step toward moving SVS from a “solo‑paradigm” to a full “choral‑paradigm.”


Comments & Academic Discussion

Loading comments...

Leave a Comment