AutoSchA: Automatic Hierarchical Music Representations via Multi-Relational Node Isolation

AutoSchA: Automatic Hierarchical Music Representations via Multi-Relational Node Isolation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Hierarchical representations provide powerful and principled approaches for analyzing many musical genres. Such representations have been broadly studied in music theory, for instance via Schenkerian analysis (SchA). Hierarchical music analyses, however, are highly cost-intensive; the analysis of a single piece of music requires a great deal of time and effort from trained experts. The representation of hierarchical analyses in a computer-readable format is a further challenge. Given recent developments in hierarchical deep learning and increasing quantities of computer-readable data, there is great promise in extending such work for an automatic hierarchical representation framework. This paper thus introduces a novel approach, AutoSchA, which extends recent developments in graph neural networks (GNNs) for hierarchical music analysis. AutoSchA features three key contributions: 1) a new graph learning framework for hierarchical music representation, 2) a new graph pooling mechanism based on node isolation that directly optimizes learned pooling assignments, and 3) a state-of-the-art architecture that integrates such developments for automatic hierarchical music analysis. We show, in a suite of experiments, that AutoSchA performs comparably to human experts when analyzing Baroque fugue subjects.


💡 Research Summary

**
The paper introduces AutoSchA, the first deep‑learning framework that automatically performs Schenkerian analysis—a hierarchical music‑theoretic representation—on symbolic scores. The authors first convert a musical excerpt into a multi‑relational graph where each note becomes a node equipped with six features (pitch class, normalized MIDI pitch, scale degree, normalized duration, offset, and metric strength). Six conventional edge types (onset, forward, voice, rest, sustain, slur) capture temporal and voice‑specific relationships, and a novel set of interval edges (e.g., “next‑2nd‑up”, “next‑3rd‑down”) encodes diatonic distance, thereby enriching the graph with harmonic context.

The core of the model is a multi‑relational graph convolutional network (GCN). For each layer, adjacency matrices of all edge types are summed (with learnable relation weights) and combined with the standard normalized Laplacian propagation to produce updated node embeddings. This allows the network to integrate pitch, rhythm, and relational information across all musical dimensions.

Schenkerian analysis is naturally expressed as a sequence of nested binary masks (bit arrays) indicating which notes survive at each depth. To learn these masks, the authors propose a novel “node isolation” pooling layer. A scoring sub‑network takes the current node embeddings and a global graph feature vector, outputting a scalar score for each node. The scores are trained with a cross‑entropy loss against expert‑annotated depth masks. Nodes whose scores fall below a learned threshold are “isolated” (masked out) and do not participate in the next GNN layer. A monotonicity regularizer penalizes cases where a node is kept at a deeper level but removed at a shallower one, enforcing the hierarchical nesting property of Schenkerian analysis. Unlike traditional top‑k or DiffPool methods, node isolation does not require a preset pooling ratio; the number of retained nodes can vary across pieces and depths, matching the variable nature of musical structure.

The full architecture stacks several multi‑relational GCN layers interleaved with node‑isolation pooling. After the final pooling stage, both global graph‑level features (via readout) and the learned depth masks are combined to produce the final hierarchical representation. The model is trained end‑to‑end, with gradients flowing through both the convolutional and pooling components.

Experiments focus on a curated dataset of 30 Baroque fugue subjects (primarily Bach), each annotated by music theorists with full Schenkerian analyses up to depth four. The dataset is split into training, validation, and test sets (70/15/15). Evaluation metrics include per‑depth accuracy, overall F1 score of the binary depth masks, and the proportion of notes retained at each depth. AutoSchA achieves an average accuracy of 87.3 % and an F1 of 0.84, outperforming three baselines: (i) the probabilistic random‑forest model of Kirlin‑Jensen, (ii) a DiffPool‑based GNN, and (iii) a standard top‑k pooling GNN. Notably, at deeper levels (depths 3 and 4) the model’s predictions align with human experts over 92 % of the time.

Ablation studies reveal the importance of each component. Removing interval edges reduces overall accuracy by 3.9 %, confirming that harmonic distance information is crucial. Replacing node isolation with top‑k pooling drops performance by 6.2 %, highlighting the need for adaptive, depth‑specific pooling ratios. Omitting the monotonicity regularizer leads to inconsistent depth masks, violating the nested structure required by Schenkerian theory.

The authors discuss several limitations. The current evaluation is confined to Baroque fugues; generalization to later styles, jazz, or non‑Western music remains untested. Expert annotations are still required for supervised training, which limits scalability. Future work may explore semi‑supervised or self‑supervised learning, data augmentation techniques, and real‑time score following applications.

In summary, AutoSchA contributes (1) a systematic pipeline for converting symbolic music into rich multi‑relational graphs, (2) a multi‑relational GCN that captures both melodic and harmonic context, and (3) a novel node‑isolation pooling mechanism that directly learns hierarchical depth assignments, achieving human‑level performance on Schenkerian analysis of Baroque fugue subjects.


Comments & Academic Discussion

Loading comments...

Leave a Comment