음악 악보 이해를 위한 대규모 멀티모달 벤치마크

February 23, 2026

Reading time: 5 minute

...

📝 Original Info

Title: 음악 악보 이해를 위한 대규모 멀티모달 벤치마크
ArXiv ID: 2511.20697
Date: 2025-11-27
Authors: Researchers from original ArXiv paper

📝 Abstract

Understanding complete musical scores requires reasoning over symbolic structures such as pitch, rhythm, harmony, and form. Despite the rapid progress of Large Language Models (LLMs) and Vision-Language Models (VLMs) in natural language and multimodal tasks, their ability to comprehend musical notation remains underexplored. We introduce Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for evaluating score-level musical understanding across both textual (ABC notation) and visual (PDF) modalities. MSU-Bench comprises 1,800 generative question-answer (QA) pairs drawn from works spanning Bach, Beethoven, Chopin, Debussy, and others, organised into four progressive levels of comprehension: Onset Information, Notation & Note, Chord & Harmony, and Texture & Form. Through extensive zero-shot and fine-tuned evaluations of over 15+ state-of-the-art (SOTA) models, we reveal sharp modality gaps, fragile level-wise success rates, and the difficulty of sustaining multilevel correctness. Fine-tuning markedly improves performance in both modalities while preserving general knowledge, establishing MSU-Bench as a rigorous foundation for future research at the intersection of Artificial Intelligence (AI), musicological, and multimodal reasoning.

💡 Deep Analysis

Deep Dive into 음악 악보 이해를 위한 대규모 멀티모달 벤치마크.

📄 Full Content

Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores Congren Dai♩,*, Yue Yang*, Krinos Li*, Huichi Zhou, Shijie Liang♩, Zhang Bo♩, Enyang Liu♩, Ge Jin♩, Hongran An♩, Haosen Zhang, Peiyuan Jing, KinHei Lee, Zhenxuan Zhang, Xiaobing Li♩, Maosong Sun† ♩Central Conservatory of Music Imperial College London Tsinghua University congren.dai@{mail.ccom.edu.cn,imperial.ac.uk} sms@tsinghua.edu.cn Abstract Understanding complete musical scores en- tails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vi- sion–Language Models to interpret full musi- cal notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for score-level mu- sical understanding across textual (ABC nota- tion) and visual (PDF) modalities. MSU-Bench contains 1,800 generative Question-Answering pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four lev- els of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, re- veal pronounced modality gaps, unstable level- wise performance, and challenges in maintain- ing multilevel correctness. Fine-tuning sub- stantially improves results across modalities while preserving general knowledge, position- ing MSU-Bench as a robust foundation for fu- ture research in multimodal reasoning. To fa- cilitate further research, we publicly release MSU-Bench and all associated resources. 1 Introduction Large Language Models (LLMs) and Vision– Language Models (VLMs) have recently exhibited strong capabilities in natural language understand- ing and generation, driving substantial advances across a broad range of Natural Language Process- ing tasks (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, a,b). In contrast, their ability to rea- son over complete musical scores remains under- explored. Existing benchmarks for musical score understanding are typically narrow in scope, con- centrating on isolated fragments, short excerpts, or *Equal contribution. †Corresponding author. multiple-choice formulations, rather than support- ing holistic reasoning over entire scores. Moreover, most prior work focuses on monophonic music, which consists of a single melodic line without har- monic or rhythmic accompaniment. Such settings fail to reflect the structural complexity and expres- sive richness required for open-ended, real-world musicological analysis. When applied to complete scores, VLMs en- counter two persistent challenges. The first is lo- calisation: models frequently struggle to correctly identify bar positions, which is a prerequisite for an- swering higher-level questions related to harmony, texture, or form. For instance, when asked “Which articulation is used in bar 7?”, models often mis- align the bar index and consequently return incor- rect markings (see Figure 1a). The second chal- lenge is hallucination, whereby models generate content that is not grounded in the score, often exac- erbating errors introduced by incorrect localisation. Together, these issues lead to unreliable interpreta- tions of complete scores and undermine confidence in model outputs relative to ideal, score-faithful answers (see Figure 1b). To measure these limitations, we curate Musical Score Understanding Benchmark (MSU-Bench), a benchmark designed to evaluate the reasoning capabilities of LLMs and VLMs on complete mu- sical scores, with particular emphasis on bar iden- tification and higher-level musical understanding. The benchmark comprises 150 complete scores and 1,800 human-curated question–answer (QA) pairs, drawn primarily from representative textbook ma- terial used in undergraduate conservatory curricula. This educational motivation reflects the premise that models capable of answering such score-based questions could function as effective instructional assistants for music students. The benchmark is organised into four hierarchical levels of musical comprehension: Onset Information, Notation and Note, Chord and Harmony, and Texture and Form. 1 arXiv:2511.20697v2 [cs.SD] 6 Jan 2026 (a) Hallucination. (b) Ideal scenario. I saw tenuto markings in bar 7. I “saw” staccato markings in bar 7. Which articulation is used in bar 7? Which articulation is used in bar 7? Pictures at an Exhibition Modest Mussorgsky Promenade 112 Pictures at an Exhibition Modest Mussorgsky Promenade 112 Figure 1: (a) Hallucination. When queried about specific score features in bars, VLMs often fabricate responses that are not grounded in the actual score. (b) Ideal scenario. Models should accurately localise and analyse bars, thereby supporting reliable higher-level musicological reasoning. These levels range from basic recognition of no- tational elements to advanced har

…(Full text truncated)…

📄 Read Full PDF on ArXiv