ConceptCaps: a Distilled Concept Dataset for Interpretability in Music Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 21k music-caption-tags triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.

💡 Research Summary

The paper presents ConceptCaps, a newly curated dataset specifically designed to support concept‑based interpretability methods such as Testing with Concept Activation Vectors (TCAV) in the music domain. Existing music caption datasets (e.g., MusicCaps) suffer from sparse, noisy, and ambiguous tags, which makes it difficult to construct clean positive and negative example sets for each high‑level musical concept (instrumentation, genre, mood, tempo, etc.). To overcome this limitation, the authors propose a three‑stage generation pipeline that separates semantic modeling, text generation, and audio synthesis, thereby improving controllability, coherence, and computational efficiency.

Stage 1 – Semantic Modeling (VAE).
All tags from the original MusicCaps are first organized into a hierarchical taxonomy of 200 attributes covering instruments, genres, moods, and tempo characteristics. A β‑VAE (β = 0.25) is trained on multi‑hot vectors representing these attributes. The VAE learns the joint distribution of attribute co‑occurrences, enabling the sampling of plausible attribute lists that respect real‑world musical relationships while avoiding contradictory combinations (e.g., “quiet death metal”). The latent space is 128‑dimensional, and both unconditional sampling (z ∼ N(0,I)) and conditional sampling (partial attribute seeds) are supported.

Stage 2 – Text Generation (Fine‑tuned LLM).
A pre‑trained instruction‑tuned Llama 3.1 8B model is fine‑tuned using QLoRA (4‑bit quantization + low‑rank adapters) on 1,890 curated (attribute list, professional caption) pairs. This fine‑tuning forces the model to embed the supplied attributes directly into fluent, music‑specific descriptions, reducing hallucination and verbosity that plague zero‑shot prompting. The resulting captions exhibit high lexical richness and precise semantic alignment with the attribute list.

Stage 3 – Audio Synthesis (MusicGen).
The generated captions are fed to MusicGen, a large‑scale music generation model trained on licensed, copyright‑free material. By increasing the guidance scale to 3.3, the authors ensure that the synthesized 30‑second audio clips follow the textual prompt closely, yielding well‑aligned audio‑text pairs. All audio is released under CC‑BY‑NC 4.0, making the final dataset fully reproducible for academic research.

Dataset Statistics.
ConceptCaps comprises 21,000 triplets (audio, caption, attribute list). Each sample includes explicit positive concept labels and, where applicable, matched counter‑examples, enabling clean TCAV probe construction. Compared to MusicCaps, the long‑tail distribution of tags is dramatically reduced, and the average number of concepts per sample is increased, which improves model learnability and interpretability.

Evaluation.

Audio‑Text Alignment: CLAP scores show a 12 % improvement over the original MusicCaps alignment, confirming that the synthetic audio faithfully reflects the textual description.
Linguistic Quality: BERTScore, MAUVE (0.84 vs. 0.71), BLEU, and ROUGE indicate that the fine‑tuned LLM produces captions of near‑human quality.
Concept Probing: TCAV experiments on a downstream music classification model demonstrate that probes built from ConceptCaps recover meaningful directions for concepts such as “acoustic guitar,” “melancholic,” and “fast tempo.” The resulting TCAV scores are stable across random seeds, suggesting that the dataset provides reliable concept exemplars.
Efficiency: Compared to API‑based generation pipelines (e.g., Ouyang et al., 2022), the three‑stage approach reduces GPU hours by >70 % while still delivering comparable or superior quality.

Ablation Studies.
Removing the VAE stage (random attribute sampling) leads to incoherent attribute combinations and a drop of ~15 % in CLAP alignment. Using a zero‑shot LLM instead of the fine‑tuned version degrades BERTScore by 0.08 and introduces frequent hallucinations. These results underline the necessity of each component.

Limitations and Future Work.
All audio is synthetic; no human listening study is reported, so perceptual realism remains unverified. The VAE inherits any bias present in the distilled MusicCaps subset, potentially limiting cultural diversity. Future directions include (1) crowd‑sourced validation of concept fidelity, (2) expansion of the taxonomy to cover non‑Western musical traditions, and (3) integration of real‑world recordings to assess domain transfer.

Impact.
ConceptCaps fills a critical gap in the music AI community by providing a large, clean, and controllable resource for concept‑level analysis. It enables researchers to construct well‑defined positive/negative example sets, leading to more trustworthy TCAV interpretations and facilitating systematic comparisons across models. The separation‑of‑concerns architecture also serves as a blueprint for other multimodal domains where semantic consistency and linguistic quality must be jointly optimized.

In summary, the paper delivers a novel dataset, a robust generation pipeline, thorough quantitative validation, and a clear roadmap for extending concept‑based interpretability in music modeling. ConceptCaps is poised to become a standard benchmark for future work in explainable music generation and retrieval.

ConceptCaps: a Distilled Concept Dataset for Interpretability in Music Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment