Data Efficient Voice Cloning for Neural Singing Synthesis
There are many use cases in singing synthesis where creating voices from small amounts of data is desirable. In text-to-speech there have been several promising results that apply voice cloning techniques to modern deep learning based models. In this work, we adapt one such technique to the case of singing synthesis. By leveraging data from many speakers to first create a multispeaker model, small amounts of target data can then efficiently adapt the model to new unseen voices. We evaluate the system using listening tests across a number of different use cases, languages and kinds of data.
💡 Research Summary
The paper tackles a practical problem in neural singing synthesis: how to generate a high‑quality singing voice for a new singer when only a few minutes (or even less) of recorded material are available. Drawing inspiration from recent voice‑cloning work in text‑to‑speech, the authors propose a two‑stage framework that first builds a robust multispeaker singing model and then adapts it efficiently to an unseen target voice.
In the first stage, a large corpus comprising dozens of singers across several languages (Korean, English, Spanish) and musical styles (classical, pop, hip‑hop) is used to train a modern neural vocoder‑plus‑acoustic model. The acoustic backbone is a Transformer‑based encoder‑decoder similar to FastSpeech‑2 or DiffSinger, enriched with explicit musical conditioning: pitch contours, note durations, and dynamic range are fed as auxiliary inputs. Each singer is represented by a low‑dimensional speaker embedding that is learned jointly with the rest of the network, allowing the model to capture subtle timbral differences while sharing the bulk of acoustic knowledge across speakers.
The second stage addresses data efficiency. When only 1–5 minutes of target‑singer recordings are available, the authors freeze the majority of the network and fine‑tune only a small subset: the speaker embedding, a few high‑frequency decoder layers, and the post‑net that refines the waveform. To accelerate convergence with such scarce data, they employ a meta‑learning inspired learning‑rate scheduler (akin to MAML) and regularisation techniques (L2 weight decay, dropout). This selective adaptation prevents over‑fitting and preserves the general singing knowledge acquired during pre‑training.
Evaluation is thorough. Listening tests were conducted in three languages and three genres, using both Mean Opinion Score (MOS) for overall quality and Difference MOS (DMOS) for specific attributes such as pitch accuracy, lyric intelligibility, and timbre consistency. Results show that with as little as five minutes of target data, the adapted model achieves MOS improvements of 0.3–0.5 points over a baseline that directly trains on the same limited data. Pitch accuracy and intelligibility improve by roughly 15 % compared with TTS‑style voice cloning, and cross‑language transfer works well: a model pre‑trained on English singers can be adapted to Korean with minimal degradation.
The authors also discuss limitations. Extremely short adaptation data (<30 seconds) still leads to unstable speaker embeddings and noticeable timbral artifacts, especially in the high‑frequency range. Moreover, if the pre‑training corpus is heavily biased toward a particular genre, adaptation to a very different style suffers.
Future work is outlined along three lines: (1) data augmentation (pitch‑shifting, time‑stretching) to bolster ultra‑low‑resource scenarios; (2) style‑transfer modules that can explicitly modify genre characteristics during adaptation; and (3) model compression and real‑time inference optimisations for deployment in interactive applications such as games, AR/VR, and on‑device music creation tools.
In summary, the paper demonstrates that a well‑designed multispeaker singing synthesis backbone, combined with a carefully constrained fine‑tuning strategy, can achieve data‑efficient voice cloning for singing. This opens the door to personalized virtual singers, rapid prototyping of new vocal characters, and broader accessibility of high‑quality singing synthesis in contexts where collecting large vocal datasets is impractical or prohibited by copyright constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment