Rethinking Music Captioning with Music Metadata LLMs

Rethinking Music Captioning with Music Metadata LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata, our method: (1) achieves comparable performance in less training time over end-to-end captioners, (2) offers flexibility to easily change stylization post-training, enabling output captions to be tailored to specific stylistic and quality requirements, and (3) can be prompted with audio and partial metadata to enable powerful metadata imputation or in-filling–a common task for organizing music data.


💡 Research Summary

The paper tackles the scarcity of high‑quality paired music‑audio and caption data by introducing a two‑stage “metadata‑first” approach to music captioning. In the first stage, the authors adapt a decoder‑only, text‑only large language model (LLM) to process audio. Audio is encoded into discrete tokens via a quantized audio encoder, and these tokens are mapped onto reserved text tokens in the LLM, effectively giving the model multimodal audio understanding. The model is then instruction‑fine‑tuned to predict structured music metadata (genre, mood, tempo, key, instruments, keywords, etc.) in JSON format directly from the audio. Crucially, the model can also accept partial metadata as input, allowing it to impute missing fields—a capability that is difficult for end‑to‑end caption generators.

In the second stage, the same pretrained LLM is used as a pure text generator to convert the predicted metadata into natural‑language captions. By feeding a carefully crafted prompt (which can include in‑context examples, metadata tags, or style specifications), the system can produce expressive captions that reflect the metadata while avoiding hallucinations. Because the conversion happens at inference time, the caption style can be changed post‑training without retraining the audio‑to‑metadata model.

Experiments were conducted on a large internal dataset of ~25 k hours of instrumental music with 10+ metadata fields (23 % of entries incomplete) and on public caption datasets (MusicCaps and Song Describer). Two baseline end‑to‑end captioners were trained on synthetic captions generated by prompting Gemma‑3‑12B‑it with the same metadata, producing MusicCaps‑style and Song Describer‑style models. All models used the same underlying Gemma‑3‑1B‑it LLM for audio adaptation and for text generation.

Evaluation used SBER‑T embedding similarity for both metadata fields and captions, as well as BM25, length similarity, and POS histogram similarity for style analysis. Results show that the metadata‑first model achieves comparable metadata prediction scores (e.g., genre 0.548 vs. 0.556 for the MusicCaps captioner, mood 0.711 vs. 0.673) and comparable caption quality (SBER‑T 0.443 vs. 0.478 for the best baseline). Style‑prompt experiments demonstrate that adding fixed in‑context examples or metadata tags can improve stylistic metrics, and that post‑hoc stylization of baseline captions yields smaller gains than the flexible metadata‑to‑caption conversion. Partial‑metadata experiments reveal a monotonic improvement in prediction quality as more fields are supplied (e.g., mood rises from 0.504 with 50 % input to 0.798 with full input).

Training efficiency is a notable advantage: the metadata predictor converged after 161 k iterations, roughly half the 347 k iterations required for the end‑to‑end captioners, translating to lower GPU time and energy consumption. The approach also enables easy adaptation to new domains or stylistic requirements by simply changing the conversion prompt, without retraining the audio model.

In summary, the paper presents a compelling alternative to direct audio‑to‑caption models by inserting a structured metadata layer. This yields faster training, post‑hoc style control, and practical metadata imputation capabilities, all while maintaining competitive performance on both metadata prediction and caption generation tasks. Future work could explore richer metadata schemas, tighter audio‑text attention mechanisms, and user‑driven prompt engineering to further boost caption fidelity and controllability.


Comments & Academic Discussion

Loading comments...

Leave a Comment