MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this work, we introduce MIDI-LLaMA, the first instruction-following MLLM for symbolic music understanding. Our approach aligns the MIDI encoder MusicBERT and Llama-3-8B via a two-stage pipeline comprising feature alignment and instruction tuning. To support training, we design a scalable annotation pipeline that annotates GiantMIDI-Piano with fine-grained metadata, resulting in a MIDI-text dataset. Compared with the baseline trained on converting MIDI into ABC notation under the same instruction-tuning procedure, MIDI-LLaMA substantially outperforms in captioning and semantic alignment in question answering. Human evaluation further confirms the advantages of MIDI-LLaMA in music understanding, emotion recognition, creativity, and overall preference. These findings demonstrate that incorporating symbolic music into large language models enhances their capacity for musical understanding.

💡 Research Summary

MIDI‑LLaMA is the first instruction‑following multimodal large language model (MLLM) that directly processes symbolic music in the form of MIDI. The system aligns a frozen MusicBERT‑style MIDI encoder (MusicBER‑T) with the open‑source Llama‑3‑8B language model through a two‑stage pipeline: (1) feature alignment, where only a linear projection layer that maps MusicBER‑T embeddings into the LLM’s text embedding space is trained, and (2) instruction tuning, where the projection is further refined while the language model is adapted via LoRA (rank = 8). This architecture enables the LLM to receive “musical tokens” concatenated with textual embeddings, allowing joint reasoning over symbolic music and natural language.

A major bottleneck for multimodal music research is the lack of large‑scale symbolic‑music‑text datasets. To address this, the authors built an automated annotation pipeline that harvests contextual information from reputable classical‑music websites using the piece title and composer as queries, then prompts GPT‑4o to extract fine‑grained metadata: genre, style, compositional background, expressive intent, and perceived emotion. The pipeline includes a “Not Enough Information” safeguard and deterministic temperature settings to minimize hallucinations. Applied to the GiantMIDI‑Piano collection (10,855 piano MIDI files), the pipeline produced valid tags for 9,803 pieces, achieving 89 % acceptance for categorical tags and 93 % for free‑form descriptions after expert verification.

From these annotations, basic musical attributes (tempo, key, time signature) were extracted with music21 and incorporated as supplementary tags. GPT‑4o then generated natural‑language question‑answer pairs for each tag, yielding multiple Q&A per piece. To increase data volume, three non‑overlapping 20‑second clips were sampled from each MIDI (aligned to bar lines when possible), resulting in 29,409 clips and roughly 2.3 million Q&A pairs—a sizable corpus for symbolic‑music multimodal learning.

For comparison, a text‑only baseline (ABC‑LLaMA) was created by converting the same MIDI data into ABC notation and training an identical Llama‑3‑8B model with the same instruction‑tuning data and hyperparameters. This isolates the effect of using explicit symbolic embeddings versus a textual transcription.

Evaluation was performed on two downstream tasks: (1) music question answering, where the model must answer queries about style, emotion, etc., and (2) music captioning, where it must generate a concise description of a clip. Automatic metrics (BLEU, METEOR, ROUGE‑L, BERTScore) showed that MIDI‑LLaMA consistently outperformed ABC‑LLaMA, especially in semantic alignment (higher ROUGE‑L and BERTScore) for QA and across all metrics for captioning. Human evaluation on 100 clips (500 pairwise judgments) further confirmed the advantage: MIDI‑LLaMA was preferred for music understanding (63 vs 25), emotion recognition (60 vs 26), and creativity (47 vs 32), while text fluency differences were negligible.

The results demonstrate that integrating a dedicated symbolic‑music encoder with an instruction‑tuned LLM yields substantially richer musical understanding than treating MIDI as plain text. Limitations include the focus on classical piano repertoire and the use of a simple linear projection, which may not capture more intricate musical relationships. Future work could expand to diverse instruments and genres, explore non‑linear or attention‑based alignment mechanisms, and optimize inference for real‑time interactive applications such as music tutoring or composition assistance.

MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

💡 Research Summary

Comments & Academic Discussion

Leave a Comment