Evaluating language models of tonal harmony

EV ALU A TING LANGU A GE MODELS OF TON AL HARMONY Da vid R. W . Sears 1 Filip K orzeniowski 2 Gerhard W idmer 2 1 College of V isual & Performing Arts, T exas T ech Uni versity , Lubbock, USA 2 Institute of Computational Perception, Johannes K epler Univ ersity , Linz, Austria david.sears@ttu.edu ABSTRA CT This study borro ws and extends probabilistic language models from natural language processing to discover the syntactic properties of tonal harmony . Language models come in many shapes and sizes, b ut their central purpose is always the same: to predict the next event in a sequence of letters, words, notes, or chords. Howe ver , fe w stud- ies employing such models have ev aluated the most state- of-the-art architectures using a lar ge-scale corpus of W est- ern tonal music, instead preferring to use relatively small datasets containing chord annotations from contemporary genres like jazz, pop, and rock. Using symbolic representations of prominent instru- mental genres from the common-practice period, this study applies a ﬂexible, data-dri ven encoding scheme to (1) ev aluate Finite Context (or n -gram) models and Recur- rent Neural Networks (RNNs) in a chord prediction task; (2) compare predictiv e accuracy from the best-performing models for chord onsets from each of the selected datasets; and (3) explain differences between the two model archi- tectures in a regression analysis. W e ﬁnd that Finite Con- text models using the Prediction by Partial Match (PPM) algorithm outperform RNNs, particularly for the piano datasets, with the regression model suggesting that RNNs struggle with particularly rare chord types. 1. INTRODUCTION For ov er two centuries, scholars have observed that tonal harmony , like language, is characterized by the logical ordering of successiv e e vents, what has commonly been called harmonic syntax . In W estern music of the common- practice period (1700-1900), pitch events group (or co- here) into discrete, primarily tertian sonorities, and the succession of these sonorities o ver time produces mean- ingful syntactic progressions. T o characterize the passage from the ﬁrst two measures of Bach’ s “ Aus meines Herzens Grunde”, for example, theorists and composers developed a chord typology that speciﬁes both the scale steps on which tertian sonorities are built ( Stufentheorie ), and the c  Sears, Korzenio wski, Widmer . Licensed under a Cre- ativ e Commons Attribution 4.0 International License (CC BY 4.0). At- tribution: Sears, Korzenio wski, W idmer. “Evaluating language models of tonal harmony”, 19th International Society for Music Information Re- triev al Conference, Paris, France, 2018. I IV 6 V 6 I V vi G: – Figure 1 . Bach, “ Aus meines Herzens Grunde”, mm. 1–2; from the Riemenschneider edition, No. 1. K ey and Roman numeral annotations appear below . functional (i.e., temporal) relations that bind them ( Funk- tionstheorie ). Shown beneath the staf f in Figure 1, this Ro- man numer al system allo ws the analyst to recognize and describe these relations using a simple le xicon of symbols. In the presence of such language-like design features, music scholars have increasingly turned to string-based methods from the natural language processing (NLP) com- munity for the purposes of pattern discov ery [6], classiﬁ- cation [7], similarity estimation [18], and prediction [19]. In sequential prediction tasks, for example, probabilistic language models hav e been dev eloped to predict the next ev ent in a sequence — whether it consists of letters, words, DN A sequences, or in our case, chords. Although corpus studies of tonal harmony ha ve become increasingly commonplace in the music research commu- nity , applications of language models for chord prediction remain somewhat rare. This is likely because language models take as their starting point a sequence of chords, but the musical surface is often a dense web of chordal and nonchordal tones, making automatic harmonic analysis a tremendous challenge. Indeed, such is the scope of the computational problem that a number of researchers hav e instead elected to start with a particular chord typology right from the outset (e.g., Roman numerals, ﬁgured bass nomenclature, or pop chord symbols), and then identify chord events using either human annotators [3], or rule- based computational classiﬁers [25]. As a consequence, language models for tonal harmony frequently train on rel- ativ ely small, hea vily curated datasets ( < 200 , 000 chords) [3], or use data augmentation methods to increase the size of the corpus [15]. And since the majority of these corpora reﬂect pop, rock, or jazz idioms, v ocabulary reduction is a frequent preliminary step to ensure improv ed model per- formance, with the researcher typically including speciﬁc chord types (e.g., major , minor , sev enth, etc.), thus ignor- ing properties of tonal harmony relating to in version [15] or chordal extension [11]. Giv en the state of the annotation bottleneck, we propose a complementary method for the implementation and e val- uation of language models for chord prediction. Rather than assume a particular chord typology a priori and train our models on the chor d classes found therein, we will in- stead propose a data-dri ven method for the construction of harmonic corpora using chor d onsets deri ved from the mu- sical surface. It is our hope that such a bottom-up approach to chord prediction could provide a springboard for the im- plementation of chord class models in future studies [2], the central purpose of which is to use predictiv e methods to reduce the musical surface to a sequence of syntactic progressions by discov ering a small vocab ulary of chord types. W e begin in Section 2 by describing the datasets used in the present research and then present the tonal encod- ing scheme that reduces the combinatoric explosion of po- tential chord types to a vocab ulary consisting of roughly two hundred types for each scale-degree in the lowest in- strumental part. Next, Section 3 describes the two most state-of-the-art architectures employed in the NLP com- munity: Finite Context (or n -gram) models and Recurrent Neural Networks (RNNs). Section 4 presents the experi- ments, which (1) ev aluate the two aforementioned model architectures in a chord prediction task; (2) compare pre- dictiv e accurac y from the best-performing models for each dataset; (3) attempt to explain the differences between the two models using a regression analysis. W e conclude in Section 5 by considering limitations of the present ap- proach, and offering a venues for future research. 2. CORPUS This section presents the datasets used in the present re- search and then describes the chord representation scheme that permits model comparison across datasets. 2.1 Datasets Shown in T able 1, this study includes nine datasets of W estern tonal music (1710–1910) featuring symbolic rep- resentations of the notated score (e.g., metric position, rhythmic duration, pitch, etc.). The Chopin dataset con- sists of 155 works for piano that were encoded in Mu- sicXML format [10]. The Assorted symphonies dataset consists of symphonic movements by Beethov en, Berlioz, Bruckner , and Mahler that were encoded in MA TCH for- mat [26]. All other datasets were downloaded from the KernScores database in MIDI format. 1 In total, the composite corpus includes the complete catalogues for Beethov en’ s string quartets and piano sonatas, Joplin’ s rags, and Chopin’ s piano works, and consists of ov er 1,000 compositions containing more than 1 million chord tokens. 1 http://kern.ccarh.org/ . Composer Genr e N pieces N tokens N types Bach Chorale 370 35,237 786 Haydn Quartet 210 159,579 1472 Mozart Quartet 82 78,201 1289 Beethov en Quartet 70* 132,896 1699 Mozart Piano 51 92,279 833 Beethov en Piano 102* 176,370 1332 Chopin Piano 155* 147,827 1790 Joplin Piano 47* 43,848 854 Assorted Symphony 29 147,549 2420 T otal 1116 1,013,786 2590 Note . * denotes the complete catalogue. T able 1 . Datasets and descriptiv e statistics for the corpus. 2.2 Chord Representation Scheme T o deri ve chord progressions from symbolic corpora using data-driv en methods, music analysis software framew orks typically perform a full expansion of the symbolic en- coding, which duplicates ov erlapping note ev ents at every unique onset time. Shown in Figure 2, expansion identiﬁes 9 unique onset times in the ﬁrst two measures of Bach’ s chorale harmonization, “ Aus meines Herzens Grunde. ” Previous studies have represented each chord accord- ing to the simultaneous relations between its note-event members (e.g., vertical intervals) [23], the sequential re- lations between its chord-ev ent neighbors (e.g., melodic intervals) [6], or some combination of the two [22]. For the purposes of this study , we have adopted a chord typol- ogy that models e very possible combination of note ev ents in the corpus. The encoding scheme consists of an ordered tuple ( S, I ) for each chord onset in the sequence, where S is a set of up to three intervals abo ve the bass in semitones modulo the octave (i.e., 12), resulting in 13 3 (or 2197) pos- sible combinations; 2 and I is the chromatic scale degree (again modulo the octave) of the bass, where 0 represents the tonic, 7 the dominant, and so on. Because this encoding scheme makes no distinction be- tween chord tones and non-chord tones, the syntactic do- main of chord types is still very large. T o reduce the domain to a more reasonable number , we ha ve excluded pitch class repetitions in S (i.e., voice doublings), and we hav e allo wed permutations. Follo wing [22], the assump- tion here is that the precise location and repeated appear- ance of a giv en interval are inconsequential to the identity of the chord. By allowing permutations, the major triads h 4 , 7 , 0 i and h 7 , 4 , 0 i therefore reduce to h 4 , 7 , ⊥i . Simi- larly , by eliminating repetitions, the chords h 4 , 4 , 10 i and h 4 , 10 , 10 i reduce to h 4 , 10 , ⊥i . This procedure restricts the domain to 233 unique chord types in S (i.e., when I is undeﬁned). T o determine the underlying tonal context of each chord onset, we employ the ke y-ﬁnding algorithm in [1], which tends to outperform other distributional methods (with an 2 The value of each vertical interval is either undeﬁned (denoted by ⊥ ), or represents one of twelve possible interval classes, where 0 denotes a perfect unison or octav e, 7 denotes a perfect ﬁfth, and so on. < 0,4,7, ⊥ > < 11,3,8, ⊥ > < 9,3,7, ⊥ > Figure 2 . Full e xpansion of Bach, “ Aus meines Herzens Grunde”, mm. 1–2. Three chord onsets are sho wn with the tonal encoding scheme described in Section 2.2 for illus- trativ e purposes. accuracy of around 90% for both major and minor keys). Since the mo vements in this dataset typically feature mod- ulations, we compute the Pearson correlation between the distributional weights in the selected key-ﬁnding algorithm and the pitch-class distribution identiﬁed in a moving win- dow of 16 quarter-note beats and centered around each chord onset in the sequence. The algorithm interprets the passage in Figure 2 in G major , for example, so the bass note of the ﬁrst harmony is 0 (i.e., the tonic). 3. LANGU A GE MODELS The goal of language models is to estimate the probabil- ity of ev ent e i giv en a preceding sequence of ev ents e 1 to e i − 1 , notated here as e i − 1 1 . In principle, these models predict e i by acquiring knowledge through unsupervised statistical learning of a training corpus, with the model architecture determining how this learning process takes place. For this study we examine the tw o most common and best-performing language models in the NLP commu- nity: (1) Markovian ﬁnite-context (or n -gram) models us- ing the PPM algorithm, and (2) recurrent neural networks (RNNs) using both long short-term memory (LSTM) lay- ers and gated recurrent units (GR Us). 3.1 Finite Context Models Context models estimate the probability of each event in a sequence by stipulating a global order bound (or determin- istic context) such that p ( e i ) depends only on the previous n − 1 e vents, or p ( e i | e i − 1 ( i − n )+1 ) . For this reason, conte xt models are also sometimes called n -gram models, since the sequence e i ( i − n )+1 is an n -gram consisting of a context e i − 1 ( i − n )+1 , and a single-ev ent prediction e i . These models ﬁrst acquire the frequency counts for a collection of se- quences from a training set, and then apply these counts to estimate the probability distribution go verning the identity of e i in a test sequence using maximum likelihood (ML) estimation. Unfortunately , the number of potential n -grams de- creases dramatically as the value of n increases, so high- order models often suf fer from the zer o-frequency pr ob- lem , in which n -grams encountered in the test set do not appear in the training set [27]. The most common solution to this problem has been the Pr ediction by P artial Matc h (PPM) algorithm, which adjusts the ML estimate for e i by combining (or smoothing ) predictions generated at higher orders with less sparsely estimated predictions from lower orders [5]. Speciﬁcally , PPM assigns some portion of the probability mass to accommodate predictions that do not appear in the training set using an escape method . The best-performing smoothing method is called mixtur es (or interpolated smoothing ), which computes a weighted com- bination of higher order and lower order models for ev ery ev ent in the sequence. 3.1.1 Model Selection T o implement this model architecture, we apply the variable-order Markov model (called IDyOM ) dev eloped in [19]. 3 The model accommodates man y possible con- ﬁgurations based on the selected global order bound, es- cape method, and training type. Rather than select a global order bound, researchers typically prefer an e xtension to PPM called PPM*, which uses simple heuristics to de- termine the optimal high-order context length for e i , and which has been shown to outperform the traditional PPM scheme in sev eral prediction tasks (e.g., [21]), so we ap- ply that extension here. Regarding the escape method, re- cent studies have demonstrated the potential of method C to minimize model uncertainty in melodic and harmonic prediction tasks [12, 21], so we also employ that method here. T o improv e model performance, Finite Context mod- els often separately estimate and then combine two sub- ordinate models trained on differed subsets of the corpus: a long-term model (L TM+), which is trained on the en- tire corpus; and a short-term (or cache ) model (STM), which is initially empty for each individual composition and then is trained incrementally (e.g., [8]). As a result, the L TM+ reﬂects inter-opus statistics from a large corpus of compositions, whereas the STM only reﬂects intra-opus statistics, some of which may be speciﬁc to that composi- tion. Finally , the model implemented here also includes a model that combines the L TM+ and STM models using a weighted geometric mean (BOTH+) [20]. Thus, we report the L TM+, STM, and BO TH+ models for the analyses that follow . 4 3.2 Recurrent Neural Networks Recurrent Neural Networks (RNNs) are powerful models designed for sequential modelling tasks. RNNs transform an input sequence x N 1 to an output sequence o N 1 through a non-linear projection into a hidden layer h N 1 , parame- terised by weight matrices W hx , W hh and W oh : h i = σ h ( W hx x i + W hh h i − 1 ) (1) o i = σ o ( W oh h i ) , (2) where σ h and σ o are the activ ation functions for the hid- den layer (e.g. the sigmoid function), and the output layer 3 The model is av ailable for download: http://code. soundsoftware.ac.uk/projects/idyom- project 4 The models featuring the + symbol represent both the statistics from the training set and the statistics from that portion of the test set that has already been predicted. ... ... Figure 3 . The basic architecture for an RNN-based language model. This model can easily accommodate more recurrent hidden layers or include additional skip- connections between the input and each hidden layer or the output. The ﬁrst input, e 0 , is a dummy symbol without an associated chord. (e.g. the softmax), respectively . W e excluded bias terms for simplicity . RNNs have become popular models for natural lan- guage processing due to their superior performance com- pared to Finite Context models [17]. Here, the input at each time step i is a (learnable) v ector representation of the pre- ceding symbol, v ( e i − 1 ) . The network’ s output o i ∈ R N types is interpreted as the conditional probability ov er the next symbol, p  e i | e i − 1 1  . As outlined in Figure 3, this proba- bility depends on all preceding symbols through the recur - rent connection in the hidden layer . During training, the categorical cross-entropy between the output o i and the true chord symbol is minimised by adapting the weight matrices in Eqs. 1 and 2 using stochas- tic gradient descent and back-propagation through time. Howe ver , this training procedure suffers from vanishing and exploding gradients because of the recursiv e dot prod- uct in Eq. 1. The latter problem can be a verted by clipping the gradient values; the former, ho wev er, is trickier to pre- vent, and necessitates more complex recurrent structures such as the long short-term memory unit (LSTM) [13] or the gated recurrent unit (GR U) [4]. These units hav e be- come standard features of RNN-based language modeling architectures [16]. 3.2.1 Model Selection Selecting good hyper-parameters is crucial for neural net- works to perform well. T o this end, we performed a num- ber of preliminary experiments to tune the networks. Our ﬁnal architecture comprises two layers of 128 recurrent units each (either LSTM or GR U), a learnable input em- bedding of 64 dimensions (i.e. v ( · ) maps each chord class to a vector in R 64 ), and skip connections between the input and all other layers. RNNs are prone to over -ﬁt the training data. W e use the network’ s performance on held-out data to identify this issue. Since we employ 4-fold cross-validation (see Sec. 4 for details), we hold out one of the three training folds as a v alidation set. If the results on these data do not impro ve for 10 epochs, we stop training and select the model with the lowest cross-entrop y on the validation data. W e trained the networks for a maximum of 200 epochs, using stochastic gradient descent with a mini-batch size of 4. Each of these 4 data points is a sequence of at most 300 chords. The gradient updates are scaled using the Adam update rule [14] with standard parameters. T o prev ent ex- ploding gradients, we clip gradient values lar ger than 1. 4. EXPERIMENTS 4.1 Evaluation T o e valuate performance using a more reﬁned method than one simply based on the accuracy of the model’ s predic- tion, we use a statistic called corpus cr oss-entr opy , denoted by H m . H m ( p m , e j 1 ) = − 1 j j X i =1 log 2 p m ( e i | e i − 1 1 ) . (3) H m represents the a verage information content for the model probabilities estimated by p m ov er all e in the se- quence e j 1 . That is, cross-entropy provides an estimate of how uncertain a model is, on average, when predicting a giv en sequence of ev ents [21], regardless of whether the correct symbol for each ev ent was assigned the highest probability in the distribution. Finally , we employ 4-fold cross-validation stratiﬁed by dataset for both model architectures, using cross-entropy as a measure of performance. 4.2 Results W e ﬁrst compare the average cross-entropy estimates across the entire corpus using Finite Context models and RNNs, and then examine the estimates across datasets for the best performing model conﬁguration from each archi- tecture. W e conclude by examining the differences be- tween these models in a regression analysis. 4.2.1 Comparing Models T able 2 presents the av erage cross-entropy estimates for each model conﬁguration. For the purposes of statisti- cal inference, we also include the 95% bootstrap conﬁ- dence interval using the bias-corrected and accelerated per- centile method [9]. For the Finite Context models, BOTH+ Model T ype H m CI a F inite Context L TM+ 4.895 4.811–4.978 STM 6.710 6.600–6.820 BO TH+ 4.893 4.800–4.966 Recurr ent Neural Network LSTM 5.583 5.539–5.626 GR U 5.600 5.551–5.645 a CI refers to the 95% bootstrap conﬁdence interv al of H m using the bias-corrected and accelerated percentile method with 1000 replicates. T able 2 . Model comparison using cross-entropy as an e val- uation metric. Chopin Piano Bach Chorale Mozart Piano Beethov en Piano Joplin Piano Haydn Quartet Mozart Quartet Beethov en Quartet Assorted Symphony BO TH+ LSTM 0 1 2 3 4 5 6 7 8 H m (bits) Figure 4 . Bar plots of the best-performing model conﬁgurations from the Finite Context (BO TH+) and RNN (LSTM) models. Whiskers represent the 95% bootstrap conﬁdence interval of the mean using the bias-corrected and accelerated percentile method with 1000 replicates. produced the lowest cross-entropy estimates on av erage, though the difference between BOTH+ and L TM+ was negligible. STM was the worst performing model ov er- all, which is unsurprising gi ven the restrictions placed on the model’ s training parameters (i.e., that it only trains on the already-predicted portion of the test set). Of the RNN models, LSTM slightly outperformed GR U, but again this dif ference was negligible. What is more, the long-term Finite Context models (BOTH+ and L TM+) signiﬁcantly outperformed both RNNs. This ﬁnd- ing could suggest that context models are better suited to music corpora, since the datasets for melodic and harmonic prediction are generally miniscule relati ve to those in the NLP community [15]. The encoding scheme for this study also produced a large vocabulary (2590 symbols), so the PPM* algorithm might be useful when the model is forced to predict particularly rare types in the corpus. 4.2.2 Comparing Datasets T o identify the differences between these models for each of the datasets in the corpus, Figure 4 presents the bar plots for the best-performing model conﬁgurations from each model architecture: BO TH+ from the Finite Context model, and LSTM from the RNN model. On a verage, BO TH+ produced the lowest cross-entropy estimates for the piano datasets (Mozart, Beethov en, Joplin), b ut much higher estimates for the other datasets. This effect was not observed for LSTM, howe ver , with the datasets’ genre — chorale, piano work, quartet, and symphony — apparently playing no role in the model’ s ov erall performance. The dif ference between these two model architectures for the Joplin and Mozart piano datasets is particularly striking. Giv en the degree to which piano works gener - ally consist of fe wer homorhythmic textures relati ve to the other genres in this corpus, it could be the case that the piano datasets feature a larger proportion of rare, mono- phonic chord types relative to the other datasets. The next section examines this hypothesis using a regression model. 4.2.3 A Re gression Model Giv en the complexity of the corpus, a number of factors might explain the performance of these models. Thus, we hav e included the following ﬁve predictors in a mul- tiple linear re gression (MLR) model to explain the av erage cross-entropy estimates for the compositions in the corpus ( N = 1136 ): 5 N tokens Cache (i.e., STM) and RNN-based language mod- els often beneﬁt from datasets that feature longer se- quences by exploiting statistical regularities in the portion of the test sequence that was already pre- dicted. Thus, N tokens represents the number of to- kens in each sequence. Compositions featuring more tokens should receiv e lower cross-entropy estimates on av erage. N types Language models struggle with data sparsity as n increases (i.e., the zero-frequency problem). One solution is to select corpora for which the vocab- ulary of possible distinct types is relatively small. Thus, N types represents the number of types in each sequence. Compositions with lar ger v ocabularies should recei ve higher cross-entropy estimates on av- erage. Improbable Events that occur with low probability in the zeroth-order distrib ution are particularly dif ﬁcult to predict due to the data sparsity problem just men- tioned. Thus, Impr obable represents the proportion of tokens in each sequence that appear in the bottom 10% of types in the zeroth-order probability distribu- tion. Compositions with a large proportion of these particularly rare types should receiv e higher cross- entropy estimates on a verage. Monophonic Chorales feature homorhythmic textures in which each temporal onset includes multiple coin- cident pitch e vents. The chord types representing these tokens should be particularly common in this corpus, but some genres might also feature poly- phonic textures in which the number of coincident ev ents is potentially quite low (e.g., piano). Thus, 5 Four of the 1116 compositions were further subdivided in the se- lected datasets, producing an additional 20 sequences in the analyses: Beethoven, Quartet No. 6, Op. 18, i v (2); Chopin, Op. 12 (2); Mozart, Piano Sonata No. 6, K. 284, iii (13); Mozart, Piano Sonata No. 11, K. 331, i (7). Monophonic represents the proportion of tokens in each sequence that consist of only one pitch e vent. Compositions with a large proportion of these mono- phonic events should receiv e higher cross-entropy estimates on av erage. Repetition Compared to chord-class corpora, data-dri ven corpora are far more likely to feature adjacent rep- etitions of tokens. Thus, Repetition represents the proportion of tokens in each sequence that feature adjacent repetitions. Compositions with a large pro- portion of repetitions should receive lower cross- entropy estimates on a verage. T able 3 presents the results of a stepwise re gression analysis predicting the average cross-entropy estimates with the aforementioned predictors. R 2 refers to the ﬁt of the model, where a value of 1 indicates that the model ac- counts for all of the variance in the outcome v ariable (i.e., a perfectly linear relationship between the predictors and the cross-entropy estimates). The slope of the line measured for each predictor , denoted by β , represents the change in the outcome resulting from a unit change in the predictor . For the Finite Context model (BO TH+), four of the ﬁv e predictors explained 53% of the variance in the cross- entropy estimates. As predicted, cross-entropy decreased as the number of tokens increased, suggesting that the model learned from past tokens in the sequence. What is more, cross-entrop y increased as the vocabulary increased, as well as when the proportion of monophonic or improb- able tokens increased, though the latter two predictors had little effect on the model. For the RNN model, the effect of these predictors was strikingly dif ferent. In this case, cross-entropy increased with the proportion of improbable e vents. Note that this predictor played only a minor role for the Finite Conte xt model, which suggests PPM* may be responsible for the model’ s superior performance. For the remaining predic- tors, cross-entropy estimates decreased when the propor - tion of adjacent repeated tokens increased. Like the Finite Context model, the RNN model also struggled when the proportion of monophonic tokens increased, but beneﬁted from longer sequences featuring smaller vocab ularies. 5. CONCLUSION This study examined the potential for language models to predict chords in a large-scale corpus of tonal compositions from the common-practice period. T o that end, we de vel- oped a ﬂexible chord representation scheme that (1) made minimal a priori assumptions about the chord typology un- derlying tonal music, and (2) allo wed us to create a much larger corpus relativ e to those based on chord annotations. Our ﬁndings demonstrate that Finite Context models out- perform RNNs, particularly in piano datasets, which sug- gests PPM* is responsible for the superior performance, since it assigns a portion of the probability mass to poten- tially rare, as-yet-unseen types. A regression analysis gen- erally conﬁrmed this hypothesis, with LSTM struggling to predict the improbable types from the piano datasets. Model Pr edictors β R 2 BO TH+ N tokens −2.079 .212 N types 1.860 .506 Monophoni c 0.233 .506 Impr obable 0.076 .530 LSTM Impr obable 0.463 .318 Repetition −0.558 .375 N types 0.817 .504 Monophonic 0.452 .568 N tokens −0.554 .591 Note . Each predictor appears in the order speciﬁed by stepwise selection, with R 2 estimated at each step. Howe ver , β presents the standardized betas estimated in the model’ s ﬁnal step. T able 3 . Stepwise regression analysis predicting the a v- erage H m estimated for each composition from the best- performing model conﬁgurations with characteristic fea- tures of the corpus. T o our knowledge, this is the ﬁrst language-modeling study to use such a lar ge v ocabulary of chord types, though this approach is far more common in the NLP community , where the selected corpus can sometimes contain millions of distinct word types. Our goal in doing so was to bridge the gulf between the most current data-driv en methods for melodic and harmonic prediction on the one hand [24], and applications of chord typologies for the creation of cor- pora using expert analysts on the other [3]. Indeed, despite recent efforts to determine the ef ﬁcacy of language mod- els for annotated corpora [11, 15], relati vely little has been done to de velop unsupervised methods for the disco very of tonal harmony in predicti ve contexts. One serious limitation of the architectures e xamined in this study is their unwav ering commitment to the sur- face. Rather than skipping seemingly inconsequential on- sets, such as those containing embellishing tones or repeti- tions, these models predict every onset in their path. As a result, the model conﬁgurations examined here attempted to predict tonal (pitch) content rather than tonal harmonic progressions per se. In our vie w , word class models could provide the necessary bridge between the bottom-up and top-down approaches just described by reducing the vo- cabulary of surface simultaneities to its most essential har - monies [2]. Along with prediction tasks, these models could then be adapted for sequence generation and auto- matic harmonic analysis, and in so doing, pro vide con ver g- ing e vidence that the statistical regularities characterizing a tonal corpus also reﬂect the order in which its constituent harmonies occur . 6. A CKNO WLEDGMENTS This project has receiv ed funding from the European Re- search Council (ERC) under the European Union’ s Hori- zon 2020 research and inno vation programme (grant agree- ment n ◦ 670035). 7. REFERENCES [1] J. Albrecht and D. Shanahan. The use of large corpora to train a ne w type of ke y-ﬁnding algorithm: An im- prov ed treatment of the minor mode. Music P er ception , 31(1):59–67, 2013. [2] P . F . Bro wn, V . J. Della Pietra, P . V . deSouza, J. C. Lai, and R. L. Mercer . Class-based n -gram models of nat- ural language. Computational Linguistics , 18(4):467– 479, 1992. [3] J. A. Bur goyne, J. W ild, and I. Fujinaga. An Expert Ground T ruth Set for Audio Chord Recognition and Music Analysis. In Proceedings of the 12th Interna- tional Society for Music Information Retrie val Confer- ence (ISMIR) , Miami, USA, 2011. [4] K. Cho, B. van Merrienboer , D. Bahdanau, and Y . Ben- gio. On the Properties of Neural Machine T ranslation: Encoder-Decoder Approaches. arXiv:1409.1259 [cs, stat] , September 2014. [5] J. G. Cleary and I H. W itten. Data compression using adaptive coding and partial string matching. IEEE T ransactions on Communications , 32(4):396– 402, 1984. [6] D. Conklin. Representation and discovery of vertical patterns in music. In C. Anagnostopoulou, M. Fer - rand, and A. Smaill, editors, Music and Artiﬁcal Intel- ligence: Lectur e Notes in Artiﬁcial Intelligence 2445 , volume 2445, pages 32–42. Springer -V erlag, 2002. [7] D. Conklin. Multiple viewpoint systems for music clas- siﬁcation. Journal of New Music Resear ch , 42(1):19– 26, 2013. [8] D. Conklin and I. H. W itten. Multiple viewpoint sys- tems for music prediction. Journal of New Music Re- sear ch , 24(1):51–73, 1995. [9] T . J. DiCiccio and B. Efron. Bootstrap conﬁdence in- tervals. Statistical Science , 11(3):189–228, 1996. [10] S. Flossmann, W . Goebl, M. Grachten, B. Nie- dermayer , and G. W idmer . The Magaloff project: An interim report. Journal of New Music Researc h , 39(4):363–377, 2010. [11] B. Di Giorgi, S. Dixon, M. Zanoni, and A. Sarti. A data-driv en model of tonal chord sequence comple xity . IEEE/A CM T ransactions on Audio, Speech and Lan- guage Pr ocessing , 25(11):2237–2250, 2017. [12] T . Hedges and G. A. W iggins. The prediction of merged attributes with multiple viewpoint systems. Journal of Ne w Music Resear ch , 2016. [13] S. Hochreiter and J. Schmidhuber . Long Short-T erm Memory . Neural Computing , 9(8):1735–1780, Novem- ber 1997. [14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. [15] F . K orzeniowski, D. R. W . Sears, and G. W idmer . A large-scale study of language models for chord predic- tion. In Pr oceedings of the International Conference on Acoustics, Speec h and Signal Pr ocessing (ICASSP) , Calgary , Canada, 2018. [16] G. Melis, C. Dyer , and P . Blunsom. On the state of the art of ev aluation in neural language models. In Sixth International Confer ence on Learning Repr esentations (ICLR) , V ancouv er , Canada, April 2018. [17] T . Mikolov , M. Karaﬁ ´ at, L. Burget, J. Cernock ´ y, and S. Khudanpur . Recurrent neural network based lan- guage model. In INTERSPEECH 2010, 11th Annual Confer ence of the International Speech Communica- tion Association, Makuhari, Chiba, J apan, September 26-30, 2010 , pages 1045–1048, Chiba, Japan, 2010. [18] D. M ¨ ullensiefen and M. Pendzich. Court decisions on music plagiarism and the predictiv e value of similar - ity algorithms. Musicæ Scientiæ , Discussion Forum 4B:257–295, 2009. [19] M. T . Pearce. The Construction and Evaluation of Sta- tistical Models of Melodic Structure in Music P er - ception and Composition . Phd thesis, City University , London, 2005. [20] M. T . Pearce, D. Conklin, and G. A. W iggins. Methods for Combining Statistical Models of Music , pages 295– 312. Springer V erlag, Heidelber g, Germany , 2005. [21] M. T . Pearce and G. A. W iggins. Improv ed methods for statistical modelling of monophonic music. Journal of New Music Resear ch , 33(4):367–385, 2004. [22] I. Quinn. Are pitch-class proﬁles really “key for key”? Zeitschrift der Gesellsc haft der Musiktheorie , 7:151– 163, 2010. [23] D. R. W . Sears. The Classical Cadence as a Closing Schema: Learning, Memory , and P erception . Phd the- sis, McGill Univ ersity , Montreal, Canada, 2016. [24] D. R. W . Sears, M. T . Pearce, W . E. Caplin, and S. McAdams. Simulating melodic and harmonic ex- pectations for tonal cadences using probabilistic mod- els. Journal of New Music Researc h , 47(1):29–52, 2018. [25] D. T emperley and D. Sleator . Modeling meter and har- mony: A preference-rule approach. Computer Music Journal , 23(1):10–27, 1999. [26] G. Widmer . Using AI and machine learning to study expressi ve music performance: Project surve y and ﬁrst report. AI Communications , 14(3):149–162, 2001. [27] I. H. W itten and T . C. Bell. The zero-frequency prob- lem: Estimating the probabilities of novel ev ents in adaptiv e text compression. IEEE T ransactions on In- formation Theory , 37(4):1085–1094, 1991.

Evaluating language models of tonal harmony

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment