GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large global, multi-technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The demos can be found at http://aaronz345.github.io/GTSingerDemo/. We provide the dataset and the code for processing data and conducting benchmarks at https://huggingface.co/datasets/AaronZ345/GTSinger and https://github.com/AaronZ345/GTSinger.
💡 Research Summary
**
The paper addresses a critical bottleneck in singing‑related AI research: the lack of a publicly available, high‑quality, multi‑task singing dataset. Existing corpora suffer from low audio fidelity, limited language and singer diversity, absence of detailed technique annotations, unrealistic musical scores, and poor suitability for emerging tasks such as technique‑controllable synthesis, technique recognition, style transfer, and speech‑to‑singing (STS) conversion. To overcome these limitations, the authors introduce GTSinger, a large‑scale, globally diverse, multi‑technique singing corpus accompanied by realistic music scores and extensive annotations.
Dataset Scale and Recording Protocol
GTSinger comprises 80.59 hours of clean, studio‑recorded vocal tracks (48 kHz, 24‑bit WAV) spanning 1,366 songs. Recordings were performed by 20 professional singers covering four vocal ranges (alto, soprano, tenor, bass) and nine widely spoken languages: Mandarin, English, Japanese, Korean, Russian, Spanish, French, German, and Italian. For each song, singers recorded two versions: a control group (natural singing without targeted technique) and a technique group where one of six commonly used singing techniques—mixed voice, falsetto, breathy, pharyngeal, vibrato, and glissando—is applied densely. An additional spoken‑lyric sentence recorded by the same singer provides 16.16 hours of paired speech for STS research. This dual‑recording design enables precise controlled comparisons while keeping lyrics, rhythm, and key identical across conditions.
Annotation Pipeline
- Phoneme‑level Alignment – A coarse alignment is generated using the Montreal Forced Aligner (MFA). Language‑specific phoneme inventories (pypinyin for Mandarin, ARPAbet for English, Epitran for Italian, MFA defaults for others) are employed. Trained annotators then manually correct word and phoneme boundaries, as well as unvoiced regions, using Praat, ensuring high‑precision alignment for both singing and speech.
- Technique and Style Labels – Annotators mark, at the phoneme level, the presence of each of the six techniques. While the technique group emphasizes the target technique, other techniques may appear naturally; the control group excludes the target technique but may contain incidental techniques. Global style labels (singing method: pop vs. bel canto; emotion: happy vs. sad; tempo: slow, moderate, fast; range: low, medium, high) are also attached to each recording.
- Realistic Music Scores – Unlike fine‑grained scores that fragment note durations, GTSinger provides realistic MusicXML scores. F0 contours are extracted with RMVPE, converted to MIDI notes via ROSVOT, and then refined by music experts who verify tempo, key, clef, and enforce regular note durations. The final scores contain explicit note types (rest, lyric, slur) and are suitable for direct use in composition software.
- Quality Assurance – For each language, a second expert inspects 25 % of the annotations, checking alignment accuracy, technique consistency, and score correctness. After validation, audio files are segmented into manageable chunks for model training.
Benchmark Experiments
To demonstrate the dataset’s utility, the authors conduct four benchmark tasks using state‑of‑the‑art models:
- Technique‑controllable SVS – Models conditioned on technique labels generate singing with explicit control over mixed voice, falsetto, etc. Objective metrics (F0 RMSE, V/UV error) and subjective listening tests show substantial improvements over baselines trained on prior corpora.
- Technique Recognition – A phoneme‑level classifier trained on the annotated data achieves high precision/recall for each technique, confirming the usefulness of the fine‑grained labels.
- Style Transfer – Global style embeddings enable conversion of a source singer’s performance into another singer’s timbre, emotion, or tempo while preserving linguistic content. Human evaluation reports high naturalness and style fidelity.
- Speech‑to‑Singing (STS) – Using the paired speech‑singing data, a sequence‑to‑sequence model learns to map spoken lyrics to sung renditions, yielding better pitch accuracy and rhythmic alignment than models trained on smaller datasets (e.g., NHSS).
All code, processing scripts, and the dataset are released under a CC BY‑NC‑SA 4.0 license on HuggingFace and GitHub, encouraging reproducibility and further research.
Impact and Limitations
GTSinger sets a new standard for singing corpora by simultaneously offering (1) the largest amount of high‑fidelity vocal data, (2) multilingual and multi‑singer diversity, (3) detailed technique annotations, (4) realistic, composer‑ready music scores, (5) global style metadata, and (6) paired speech for STS. These attributes make it suitable for virtually any current singing‑related task and open avenues for future work such as cross‑lingual synthesis, fine‑grained expressive control, and integration with music production pipelines. The authors acknowledge potential risks, including cultural bias, privacy concerns, and the non‑commercial license restriction, and suggest future extensions to cover more languages, musical genres, and acoustic environments.
In summary, GTSinger fills a long‑standing gap in the singing AI ecosystem, providing a comprehensive, well‑annotated, and openly accessible resource that is poised to accelerate research and practical applications across the full spectrum of singing technologies.
Comments & Academic Discussion
Loading comments...
Leave a Comment