M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases

M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Machine-generated music (MGM) has emerged as a powerful tool with applications in music therapy, personalised editing, and creative inspiration for the music community. However, its unregulated use threatens the entertainment, education, and arts sectors by diminishing the value of high-quality human compositions. Detecting machine-generated music (MGMD) is, therefore, critical to safeguarding these domains, yet the field lacks comprehensive datasets to support meaningful progress. To address this gap, we introduce \textbf{M6}, a large-scale benchmark dataset tailored for MGMD research. M6 is distinguished by its diversity, encompassing multiple generators, domains, languages, cultural contexts, genres, and instruments. We outline our methodology for data selection and collection, accompanied by detailed data analysis, providing all WAV form of music. Additionally, we provide baseline performance scores using foundational binary classification models, illustrating the complexity of MGMD and the significant room for improvement. By offering a robust and multifaceted resource, we aim to empower future research to develop more effective detection methods for MGM. We believe M6 will serve as a critical step toward addressing this societal challenge. The dataset and code will be freely available to support open collaboration and innovation in this field.


💡 Research Summary

The paper addresses a critical gap in the field of machine‑generated music detection (MGMD): the lack of a comprehensive, diverse benchmark dataset. Existing resources such as FakeMusicCaps and SONICS either focus on text‑to‑music alignment or on end‑to‑end song synthesis, but they do not cover the breadth of generators, languages, cultures, genres, and instruments needed for robust MGMD research. To fill this void, the authors introduce M6, a large‑scale, multi‑dimensional dataset specifically designed for binary classification of music authenticity.

M6 comprises two main components. The human‑made portion draws from four well‑known public corpora—GTZAN (genre‑balanced), Free Music Archive (FMA), COSIAN (Japanese vocal tracks), and the Musical Instruments Sound Dataset (MISD). These sources provide a balanced representation of genres, instruments (piano, guitar, violin, flute), and durations (30 seconds to 3 minutes). All human tracks are released under non‑commercial licenses, and only metadata is shared for copyrighted material.

The machine‑generated portion is built using three distinct generation pipelines: AMG (a melody‑and‑text conditioned model limited to 10‑second background pieces), MG (the large‑scale Facebook MusicGen model producing 60‑second clips), and MusicGPT (an extended interface that fine‑tunes MG with GPT‑generated prompts). Prompt engineering is automated via GPT‑3.5, ensuring consistent coverage of three target languages—Chinese (ZH), Japanese (JP), and English (EN)—and associated cultural contexts. The dataset standardises most clips to 60 seconds to accommodate current audio models, while a dedicated subset of longer tracks (2–3 minutes) tests long‑range sequence handling.

In total, M6 contains roughly 12 TB of raw WAV audio (44.1 kHz) together with rich metadata: generator type, language, genre, instrument, length, and the exact textual prompt used. This granularity enables fine‑grained analyses of how specific factors influence detection performance.

Baseline experiments evaluate two families of models. A CNN‑based spectrogram classifier (ResNet‑18) achieves 78 % accuracy and 0.84 AUROC, outperforming a Transformer‑based architecture (SpecTTTra / Audio‑Transformer) that reaches 73 % accuracy. The authors note that Transformers struggle with the long‑range dependencies inherent in 60‑second audio, and that mixing multiple generators leads to over‑fitting on particular generator “styles.” These results highlight the current limitations of MGMD technology and underscore the need for better long‑sequence modeling, domain adaptation, and possibly watermarking techniques.

The paper concludes by emphasizing M6’s role as a foundational resource for the community. The dataset, code, and evaluation scripts are released on Hugging Face, and the authors commit to continual updates—adding new generators, languages, and cultural domains. They outline future research avenues, including multi‑modal detection (audio‑text), meta‑learning for cross‑generator generalisation, and robust watermarking schemes. By providing a truly diverse, high‑quality benchmark, M6 aims to accelerate the development of reliable MGMD methods, protect artistic value, and support responsible deployment of AI‑generated music.


Comments & Academic Discussion

Loading comments...

Leave a Comment