IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval

IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal Information Retrieval has made significant progress in recent years, leveraging the increasingly strong multimodal abilities of deep pre-trained models to represent information across modalities. Music Information Retrieval (MIR), in particular, has considerably increased in quality, with neural representations of music even making its way into everyday life products. However, there is a lack of high-quality benchmarks for evaluating music retrieval performance. To address this issue, we introduce \textbf{IncompeBench}, a carefully annotated benchmark comprising $1,574$ permissively licensed, high-quality music snippets, $500$ diverse queries, and over $125,000$ individual relevance judgements. These annotations were created through the use of a multi-stage pipeline, resulting in high agreement between human annotators and the generated data. The resulting datasets are publicly available at https://huggingface.co/datasets/mixedbread-ai/incompebench-strict and https://huggingface.co/datasets/mixedbread-ai/incompebench-lenient with the prompts available at https://github.com/mixedbread-ai/incompebench-programs.


💡 Research Summary

IncompeBench addresses a critical gap in Music Information Retrieval (MIR) by providing an openly licensed, fine‑grained benchmark that can be freely shared and reproduced. The authors start from the Incompetech collection, a set of over 2,000 CC‑BY tracks, and prune it to 1,574 high‑quality 30‑second audio chunks (removing tracks shorter than 90 seconds and selecting one random chunk per track).

To generate realistic, diverse natural‑language queries, the authors introduce a two‑stage pipeline. First, each audio chunk is processed by Gemini 3 Pro to produce a “song card” containing roughly 30 musical attributes (tempo, rhythm, genre, instrumentation, influences, etc.). These structured cards are then fed, together with the audio and a randomly chosen user persona (from Nvidia’s Nemotron Persona set), into a second LLM prompting step. The model selects which persona would be interested in the song, chooses a subset of attributes to target, and is given explicit constraints on the number of attributes (1‑4), query style (keyword, question, instruction, conversational, descriptive), length (3‑26 tokens), and whether to include negations. Two candidate queries are generated and one is randomly kept, yielding 500 queries with a balanced distribution across styles, attribute counts, and negation rates.

Because annotating every possible query‑document pair would be prohibitive (over 850 k pairs), the authors employ a candidate selection phase. For each query they retrieve the top‑500 documents using a heterogeneous ensemble of state‑of‑the‑art music‑text models (CLAP, TTM‑R++, CLAMP‑3, ColQwen‑Omni, a proprietary internal model, and mixedbread‑embed‑large). Reciprocal Rank Fusion merges these lists into a top‑250 candidate set per query. The final annotation step uses Gemini 3 Pro, prompted with the UMBRELA relevance framework, to assign a four‑level graded relevance score (0–3) to each (query, audio) pair. This results in more than 125 000 individual relevance judgments.

Human validation with expert annotators yields a quadratic weighted Cohen’s κ of 0.94, confirming that the automated labels closely match human judgments. The authors note that the LLM tends to be overly lenient, especially for partially relevant cases, and therefore release two evaluation variants: IncompeBench‑Lenient (all four grades retained) and IncompeBench‑Strict (only the highest relevance grade kept, discarding ambiguous partial matches).

Baseline experiments on both variants evaluate several recent text‑to‑music retrieval models, including CLAP, TTM‑R++, CLaMP‑3, and ColQwen‑Omni. Results show uniformly low absolute performance, highlighting the difficulty of the task and the need for better models. Importantly, performance differences between the two benchmark versions are non‑trivial, demonstrating that IncompeBench can discriminate between systems at a fine granularity.

All resources—audio files, queries, qrels, prompts, and the DSPy programs used for generation—are released under CC‑BY on HuggingFace, ensuring full reproducibility and encouraging community contributions. The paper concludes by emphasizing the benchmark’s role in advancing MIR research, suggesting future extensions such as larger corpora, inclusion of sheet‑music or MIDI modalities, and integration of real user interaction logs.


Comments & Academic Discussion

Loading comments...

Leave a Comment