KOINEU

February 10, 2026

Reading time: 79 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.22621
Date:
Authors: Unknown

📝 Abstract

Progress in automatic chord recognition has been slow since the advent of deep learning in the field. To understand why, I conduct experiments on existing methods and test hypotheses enabled by recent developments in generative models. Findings show that chord classifiers perform poorly on rare chords and that pitch augmentation boosts accuracy. Features extracted from generative models do not help and synthetic data presents an exciting avenue for future work. I conclude by improving the interpretability of model outputs with beat detection, reporting some of the best results in the field and providing qualitative analysis. Much work remains to solve automatic chord recognition, but I hope this thesis will chart a path for others to try.

📄 Full Content

This project was planned in accordance with the Informatics Research Ethics policy. It did not involve any aspects that required approval from the Informatics Research Ethics committee.

I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Pierre Lardet)

Chords form an integral part of music. Part of how musicians understand music is through harmonic structure. Chord annotations are a symbolic representation of the chords in a piece of music. They allow music to be easily shared, performed, improvised and analysed. Not all chord annotations available online are free or of a high enough quality because creating high-quality chord annotations requires a trained musician.

To this end, I investigate the use of deep learning in automatic chord recognition to create chord annotations of music. Data-driven methods have dominated the field for over a decade. However, the significant progress of early models has not continued in recent years; the problem remains far from solved.

In this work, I first aim to understand why performance improvements have stagnated. I implement a standard benchmark model and conduct a thorough analysis of its behaviour. This involves looking at the model’s common mistakes, performance on rarer chords and how its predictions relate to time. I then use these observations to study different methods of improving these models.

I also conduct novel research on using generative models as both feature extractors and a source of new data. This is enabled by chord-conditioned generative models developed in recent years. I conclude by rethinking how the model predicts chords in time by incorporating beat estimation. This work goes towards enabling software which can be used to better understand, create and learn music. Easily accessible and accurate chord recognition models would allow producers to better understand their work and musicologists to study larger datasets. Musicians and hobbyists could access chord annotations for their favourite songs or analyse their performances and improvisations.

The analysis of existing models, exploration of improvements and discussion of new research directions constitute a novel contribution to the field of automatic chord recognition. Despite the lack of performance improvements in recent years, I hope this work motivates others to continue pursuing research aiming to solve the problem posed by automatic chord recognition.

The thesis is structured as follows:

• Chapter 2 provides background information on harmony, chord recognition and musical data. I then discuss existing literature on the subject, pointing out trends in the field and the most exciting avenues for research.

• Chapter 3 describes the datasets, evaluation metrics and training procedure used.

• Chapter 4 contains the implementation of a convolutional neural network from the literature, followed by an analysis of its properties and predictions. I observe behaviours which provide opportunities to improve the model.

• Chapter 5 extends this work by studying various methods of improvement. Some of these experiments analyse existing improvements, while others present novel avenues of research.

• Chapter 6 concludes the thesis and provides suggestions for future work.

All code is available on GitHub. 1 Data can be made available upon request. 2 Chapter 2

In this chapter, I first introduce harmony and chords and their role in music. I then discuss how music can be represented as input to a machine learning model. This is followed by an overview of the field of automatic chord recognition (ACR). This includes the datasets and models that are commonly used in ACR, the challenges that are faced in this field and future directions.

Harmony is the combination of simultaneously sounded notes. A common interpretation of such sounds is as a chord. Chords can be thought of as a collection of at least two notes built from a root note and scale. Any notes from the scale can be present, but the most common are the third, fifth and seventh. A chord’s quality is determined by the intervals between notes in the chord irrespective of the root note. The most common qualities are major and minor. Many other qualities exist, such as diminished and suspended. Chords can be played in inversion, where the root note is not the lowest. In this work, chords are represented using Harte notation [1].

Chords can be closely related. C:maj7 is very close to C:maj. The only difference is an added major seventh. An important relation in music theory is between relative major/minor chords. These pairs of chords are built from the same scale and share many notes. For example, G:maj and E:min are related in this way. It is possible for different chords to share the same set of pitch classes like G:maj6 and E:min7.

Chords are an important part of music. They provide harmonic context for a melody and can be used to convey emotion, tension and release [2]. They are also crucial for improvisation, where musicians will play notes that fit the chord progression [3]. Contemporary guitar music is often represented by a chord sequence. Chords are also important for songwriting and production, where a chord progression can form the basis of a song. Music analysis also makes heavy use of chords. Musicologists can analyse the harmonic structure of a piece to understand the composer’s intentions and why we enjoy the music [4].

Chord recognition is the task of identifying which chord is playing at any moment in a piece of music. This can be useful for creating notated versions of songs for musicians, musicologists and music recommendation. Those wishing to learn a song may visit websites such as Ultimate Guitar 1 where users submit chord annotations for songs. Musicologists may wish to analyse the harmonic structure of a piece of music. Music recommendation systems can recommend songs based on their harmonic content as similar music will often have similar harmonic content [5]. For example, modern pop music famously uses many similar chords2 while contemporary jazz music is known for its complex and rich exploration of harmony.

All of the above motivate the need for accurate chord annotations. However, annotations from online sources can be of varying quality and may not be available for all songs [6].

The task of annotating chords is time-consuming and requires a trained musician [7].

Automatic chord recognition systems have the potential to alleviate these problems by providing a fast, accurate and scalable solution.

Unfortunately, chord recognition is a non-trivial task. Which chord is playing is inherently ambiguous. Different chords can share the same notes. The same chord can be played on different instruments with unique timbres. Precisely when a chord starts and ends can be imprecise. Whether a melody note is part of a chord and whether a melody alone is enough to imply harmonic content are both ambiguous. In order to identify a chord, data across time must be considered. For example, a chord may be vamped or arpeggiated. Audio also contains many unhelpful elements for chord recognition such as reverb, distortion and unpitched percussion.

Recorded music can be represented in a variety of ways on a computer. The simplest is as a raw time series of amplitudes, referred to as the audio’s waveform. Data in the raw audio domain has been applied in generative models such as Jukebox [8] and MusicGen [9] and autoencoders [10].

Spectrogram: A spectrogram is a transformation of a waveform into the time-frequency domain calculated via a short-time Fourier transform (SFTF). Spectrograms are commonly used in many audio processing tasks such as audio search [11] and music transcription [12]. As of yet, linear spectrograms computed using the STFT have not been used in ACR tasks [13]. Other transformations relate better to how humans understand pitch and harmony.

CQT: A common version of the spectrogram used in music transcription is the constant-Q transform (CQT) [14]. The frequency bins of a CQT are logarithmically spaced and has bin widths that are proportional to frequency. This is motivated by the logarithmic nature of how humans perceive pitch intervals in music: a sine wave with double the frequency is perceived as one octave higher. As such, CQTs are used in many music transcription tasks and are very popular for ACR [15,16]. The hop length of the CQT is the number of samples used from the waveform to calculate representations per frame being analysed. It determines the time resolution of the CQT. A shorter hop length results in a higher time resolution. An example CQT from the dataset used in this work is shown in Figure 2.1. As Korzeniowski and Widmer [17] note, CQTs are preferred to other spectrograms for ACR due to their finer resolution at lower frequencies and for the ease with which pitch can be studied and manipulated. For example, CQTs make pitch shifting possible through a simple shift of the CQT bins. Another logarithmic variant of the spectrogram is the mel-spectrogram, based on the mel-scale [18]. It is intended to mimic the human ear’s perception of sound and is commonly used in speech recognition [19] but has also been used in music transcription tasks [20]. Chroma Vectors: Chroma vectors are a 12-dimensional time-series representation where each dimension corresponds to a pitch class. Each element represents the strength of each pitch class in the Western chromatic scale. Such features have been generated by deep learning methods [21] or by hand-crafted methods [22,23] and have seen use in recent ACR models [24]. A representation of a song as a chroma vector over time can be thought of as another type of spectrogram, referred to as a chromagram.

Generative Features: More recently, features extracted from generative music models have been used as input, referred to in this work as generative features. The proposed benefit is that the vast quantities of data used to train these models leads to rich representations of the music. These features have been shown to contain useful information for music information retrieval (MIR) tasks [25]. Chris Donahue and Liang [20] use features from JukeBox [8] to train a transformer [26] for both melody transcription and chord recognition. They found that these features outperformed mel-spectrograms in melody transcription tasks but did not report results for ACR nor with CQTs.

The field of ACR has seen considerable research since the seminal work of Fujishima [27] in 1999. Below, I provide a brief overview of the field including the datasets, metrics and models that are commonly used. I conclude by discussing some of the common challenges faced and motivating the research carried out in this project.

Sources of data that have seen common use in ACR relevant to this work include:

• Mcgill Billboard: over 1000 chord annotations of songs randomly selected from the Billboard ‘Hot 100’ Chart between 1958 and 1991. [7] • Isophonics: 300 annotations of songs from albums by The Beatles, Carole King and Zweieck. [28] • RWC-Pop: 100 pop songs with annotations available3 for chords. [29] • USPop: 195 annotations of songs chosen for popularity. [30] • JAAH: 113 annotations of a collection of jazz recordings. [31] • HookTheory: 50 hours of labelled audio in the form of short musical segments, crowdsourced from an online forum called HookTheory4 . [20] Many of these have been compiled into the Chord Corpus by de Berardinis et al. [6] with standardised annotation formats. However, audio is scarce due to copyright issues. The most common dataset is comprised of 1217 songs compiled from the first four of the above collections. This dataset is dominated by pop songs.

Another problem is that the existing data is imbalanced, with a large number of major and minor chords and fewer instances of chords with more obscure qualities. This can lead to models that are biased towards predicting major and minor chords. Attempts to address such an imbalanced distribution have been made by weighting the loss function [32], adding ‘structure’ to the chord targets [16,32], re-sampling training examples to balance chord classes [21] and curriculum learning [33].

Pitch Augmentation: Due to the lack of labelled data, data augmentation via pitch shifting has been applied to ACR. Input features are pitch shifted while chords are transposed. McFee and Bello [16] note the large increase in performance using pitch shifting directly on the audio. Other works have since used pitch shifting directly on the CQT [32]. No work has compared the two methods.

Synthetic Data Generation: Data has been scaled up using augmentation and semisupervised learning with some success [34]. Research has been done into the use of synthetic data [35,36] and self-supervised learning [37] for other MIR tasks but not for ACR. With the advent of new models which accept chord-conditioned input [38], the possibility of generating synthetic data for ACR is an exciting avenue of research.

Model Architectures: Since the work of Humphrey and Bello [39], chord recognition has been predominantly tackled by deep learning architectures. The authors used a convolutional neural network (CNN) to classify chords from a CQT. CNNs have been combined with recurrent neural networks (RNNs) [40,32,16] with a CNN performing feature extraction from a spectrogram and an RNN sharing information across frames. More recently, transformers have been applied in place of the RNN [20,24,41,33,42].

Despite increasingly complex models being proposed, performance has not improved by much. In fact, Park et al. [42] found that their transformer performed marginally worse than a CNN. Humphrey and Bello [43] talk of a ‘glass ceiling’ with increases in performance stagnating after the advent of deep learning in ACR. This was 10 years ago and the situation has not changed significantly. Despite this, continued efforts have been made to develop complex models with the sole motivation of improving performance. This has led to overly complex ACR models seeing use in other MIR tasks such as chord-conditioned generation where Lan et al. [38] use the model developed by Park et al. [42] despite its lack of improvement over simpler predecessors. Furthermore, there is little comparison to simple baselines to provide context for the performance gain associated with increasing model complexity.

Decoding: A decoding step is often performed on the probabilities outputted by the neural network. This can smooth predictions and share information across frames. Miller et al. [21] use a hidden Markov model (HMM), treating the probability distributions over chords generated by the model as emission probabilities and constructing a handcrafted transition function. Other works have used conditional random fields (CRF) to model the dependencies between chords [32]. Both HMMs and CRFs can use either learned transition matrices or homogeneous penalties for transitions to different chords.

It is unclear whether or not learning transitions is better. In both cases, self-transition probabilities are very large and Cho et al. [44] argue that increases in performance can be mostly attributed to the reduction in the number of transitions. However, a more recent analysis of such behaviour is missing from the literature.

Model Analysis: Korzeniowski and Widmer [45] visualise the outputs of layers of the CNN and find that some feature maps correspond to the presence of specific pitches and intervals. Korzeniowski and Widmer [17] visualise the importance of different parts of an input CQT using saliency maps, noting the clear correlation between pitch classes present in a chord and the saliency maps. Confusion matrices over chord roots and qualities are also commonly used to analyse the performance of models. For example, McFee and Bello [16] found that similar qualities are often confused with each other and that the model favours the most common chord qualities. Park et al. [42] attempt to interpret attention maps produced by their transformer as musically meaningful.

Regardless of such analyses, too much effort is spent on motivating complex model architectures with a focus on minor improvements in performance. In this work, I will conduct a thorough analysis of an existing model. I will take inspiration from some of the analyses above while adding a more nuanced understanding of the model’s behaviour and failure modes by way of example.

Chords exist in time. How the time dimension is processed prior to being fed into the model matters. When audio is transformed into a spectrogram, each vector of frequencies represents a fixed length of time, called a frame. The frame length is determined by the hop length used when calculating the CQT. Constant frame lengths can be made short enough such that the constraint imposed on the model to output chord predictions on a per-frame basis is not limiting. However, different hop lengths have been used, varying from 512 [32] up to 4096 [16]. Which hop length works best remains unclear.

More recently, Chris Donahue and Liang [20] used a frame length determined by beats detected from the audio. Because they focus primarily on melody transcription, they define frames to be a 1/16th note ≈ 125ms with 120 beats per minute (BPM). Such beat synchronicity has been proposed for chord recognition. The underlying assumption is that chords tend to change on the beat. This reduces the computational cost of running the model due to a decreased frame rate and, more importantly, leads to a more musically meaningful interpretation of the output. However, Cho et al. [44] and Cho and Bello [46] argue that because beat detection is far from perfect, restricting frames to beats can hurt performance. Beat detection models have improved since then. Proper analysis of beat-synchronous chord recognition in the modern setting is lacking in the literature. Durán and de la Cuadra [47] jointly estimate beats and chords but use a different jazz-specific dataset and do not analyse how beat-wise predictions affect performance.

Pauwels et al. [13] provide an overview of ACR up to 2019 since the seminal work of Fujishima [27] in 1999 and provide suggestions for future avenues of research. They look at future research directions. This includes the use of different representations for both audio and chords, addressing the mismatch between chord changes and discretised frames fed to a model, looking at the larger structures in music like verses and chords, incorporating other elements of the music such as melody and genre, methods of handling subjectivity of chords and the imbalance present in chord datasets. Since then, different works have addressed some of these problems in various ways. Among these problems, the focus has been primarily on addressing the imbalance in the chord dataset.

In this work, I will implement a simple model that remains competitive with the stateof-the-art [16]. I will then conduct a thorough analysis of the model and its architecture. I will look at common methods for improving ACR models with more detailed analyses than have previously been conducted. This analysis will provide insight into the strengths and weaknesses of such models. It may also provide guidance for further improvements. I will also look at novel methods of improvement made possible through generative and beat detection models. This includes the use of generative features and synthetic data as input to the model as well as beat-synchronous frames. Finally, I will evaluate the improved models in terms of their performance and as a tool for musicians and musicologists.

In this chapter, I outline the datasets used in this work, the preprocessing applied to the audio and chord annotations, the evaluation metrics used to compare the models and details of the training process.

I spent the initial period of this project finding a suitable dataset for training and testing. [43] to combine some of the known datasets for chord recognition. The dataset consists of subsets from the above sources filtered for duplicates and selected for those with annotations available. In total, there are 1,213 songs. The dataset was provided with obfuscated filenames and audio as mp3 files and annotations as jams files [48].

Several possible sources of error in the dataset are investigated below.

Duplicates: Files are renamed using provided metadata identifying them by artist and song title. This is done to identify duplicates in the dataset. There is only one: Blondie’s One Way or Another, which has two different recordings. It is removed from the dataset. Automatic duplicate detection is conducted by embedding each audio using mel-frequency cepstral coefficients (MFCC) [49]. This function is commonly used to embed audio into low dimensions. This provides a fast and easy way of quantifying similarity. Audio is passed through the mfcc provided in librosa [23] with 20 coefficients. A song’s embedding is calculated as the mean MFCC over all frames. Cosine similarities are then calculated for all pairs of tracks. None of the top 50 similarity scores yielded any sign of duplication. I proceed with the assumption that there are no further duplicates in the dataset.

Chord-Audio Alignment: It is pertinent to verify that the chord annotations align with the audio. Misaligned annotations could make training impossible. Ten songs are manually investigated for alignment issues by listening to the audio and comparing it to the annotations directly. The annotations are all well-timed with detailed chord labels.

Automatic analysis of the alignment of the audio and chord annotations is also done using the cross-correlation between the derivative of the CQT over time and the chord annotation. A maximum correlation at a lag of zero would indicate good alignment as the audio changes at the same time as the annotation. The derivative of the CQT in the time dimension is estimated using the delta function from librosa. The chord annotations are converted to a binary vector, where each element corresponds to a frame in the CQT and is 1 if a chord change occurs at that frame and 0 otherwise. Both the CQT derivatives and binary vectors are normalised by subtracting the mean and dividing by the standard deviation. Finally, cross-correlation is computed using the correlate function from numpy. A typical cross-correlation for a song is visualised in Appendix A.3. The cross-correlation is periodic and repeats every 20 frames or so.

Listening to the song, the period of repetition is a fraction of a bar length.

To check alignment across the dataset, I plot a histogram of the lag of the maximum cross-correlations over songs in Figure 3.1. Under the assumption that the annotations are not incorrect by more than 5 seconds, the region of possible maximal lags is restricted to a window of 50 frames on either side of 0. This reduction does not change the shape of the picture. Instead, focusing on a reduced set of lags allows more detail to be visible. The majority of songs have a maximum lag close to 0, with a few outliers. This can be attributed to noise. A final check is done by looking at the difference in length of the audio files and chord annotations. A histogram of differences in length is also shown in the figure. The majority of songs have a difference in length of 0, with a few outliers almost all less than a second. This evidence, combined with the qualitative analysis, is convincing enough to leave the annotations as they are for training. Incorrect and Subjective Annotations: Throughout manual listening, no obviously wrong annotations were found. However, looking at songs on which the preliminary models perform the worst using the mirex metric, three songs stick out. ‘Lovely Rita’ by the Beatles, ‘Let Me Get to Know You’ by Paul Anka and ‘Nowhere to Run’ by Martha Reeves and the Vandellas all had scores below 0.05. In these songs, the model consistently guessed chords one semitone off, as if it thought the song was in a different key. Upon listening, it became clear that the tuning was not in standard A440Hz for the first two songs and the key of the annotation was wrong for the other. These songs were removed from the dataset. All reported results exclude these data points. No other songs were found to have such issues.

Chord annotations are inherently subjective to some extent. Detailed examples in pop are given by Humphrey and Bello [43]. They also note the presence of several songs in the dataset of questionable relevance to ACR, as the music itself is not well-explained by chord annotations. However, these are kept in for consistency with other works as this dataset is often used in the literature. Some works decide to use the median as opposed to the mean accuracy in their evaluations in order to counteract the effect of such songs on performance [16]. We think that this is unnecessary as the effect of these songs is likely to be small and we do not wish to inflate our results inadvertently. Further evidence for the use of the mean is given in Section 3.2.

I first convert the audio to a constant-Q transform (CQT) representation introduced in Section 2.1.3. CQTs are common in ACR and is used as a starting point for this work.

The CQT was computed using librosa, using the built-in cqt function. A sampling rate of 44100Hz was used, with a hop size of 4096, 36 bins per octave, 6 octaves and a fundamental frequency corresponding to the note C1. These parameters were chosen to be consistent with previous works [16] and with standard distribution formats. The CQT is returned as a complex-valued matrix containing phase, frequency and amplitude information. Phase information was discarded by taking the absolute value before being converted from amplitude to decibels (dB), equivalent to taking the logarithm.

The CQT matrix of a song has size 216 × F where 216 is the number of frequency bins and F is the number of frames in the song. The number of frames can be calculated as F = ⌈ 44100 4096 L⌉ where L is the length of the song in seconds, 44100 is the sampling rate in Hertz (Hz) and 4096 is the hop length in samples. A 3-minute song has just under 2000 frames. To save on computational cost, the CQT was pre-computed into a cached dataset rather than re-computing each CQT on every run.

The chord annotation of a song is represented as a sorted dictionary, where each entry contains the chord label, the start time and the duration. The chord label is represented as a string in Harte notation [1]. For example, C major 7 is C:maj7 and A half diminished 7th in its second inversion is A:hdim7/5. The notation also includes N, a no chord symbol, and X, an unknown chord symbol.

Such a chord vocabulary is too flexible to be used as directly as a target for a machine learning classifier trained on limited data. It would contain thousands of classes, many of which would appear only once. Instead, I define a restricted chord vocabulary. This contains 14 qualities: major, minor, diminished, augmented, minor 6, major 6, minor 7, minor-major 7, major 7, dominant 7, diminished 7, half-diminished 7, suspended 2, and suspended 4. N denotes no chord playing and chords outside the vocabulary are mapped to X, a dedicated unknown symbol. Letting C denote the size of the chord vocabulary, C = 12 • 14 + 2 = 170. This vocabulary is consistent with much of the literature [16,43,32]. Jiang et al. [32] use a more detailed vocabulary by including inversions, but I decide to remain consistent with other previous works. As McFee and Bello [16] note, C = 170 is sufficient for the dataset to exhibit a significant imbalance in the chord distribution. Their methodology is easily extensible to larger vocabularies. If performance is not yet satisfactory on C = 170, it is unlikely that performance will improve with a larger vocabulary.

Both training labels and evaluation labels are converted to this vocabulary. If the evaluation labels were kept in the original Harte notation, the model would be unable to identify them. The method for converting from Harte notation to a symbol in the chord vocabulary is similar to that used by McFee and Bello [16] and is detailed in Appendix A.7.

A simpler chord vocabulary is also sometimes used. It contains only the major and minor quality for each root, and the N and X symbols. For example, C:maj7 is mapped to C:maj while A:hdim7/5 is mapped to X. For this vocabulary, C = 26. I did some preliminary tests with this vocabulary but quickly found that model performance was similar over the two vocabularies. Results and analysis can be found in Appendix A.6. Additionally, the majmin evaluation metric compares chords over this smaller vocabulary and is mentioned in Section 3.2. The smaller vocabulary is not used in the rest of this work as there seems to be no advantage over the larger vocabulary.

Frames are allocated a chord symbol based on which chord is playing in the middle of the frame. While this may not be a perfect solution, frames are ≈ 93ms long, which is shorter than the minimum duration of a chord in the dataset. This guarantees that the chord label for every frame plays for at least half the frame. Furthermore, only 4.4% of frames include a chord transition.

Much of the recent literature has focused on the long tail of the chord distribution, using various methods to address the issue. It is first helpful to understand the distribution of chords in the datasets, illustrated in Figure 3.2. The distribution is broken down both by root and quality, using the chord vocabulary with C = 170. The plots show that the distribution over qualities is highly skewed, with major and minor chords making up the majority of the dataset, and qualities like majorminor and diminished 7th chords, two or three orders of magnitude less common. The distribution over roots is far less skewed, although there is a preference for chords in keys with common roots like A, C and D.

As is standard for ACR, I use weighted chord symbol recall (WCSR) to evaluate chord classifiers. Simply put, WCSR measures the fraction of time that a classifier’s predictions are correct. I include a formal definition in Appendix A.2. Correctness can be measured in a variety of ways, such as root, third and seventh, which compare along roots, thirds, or sevenths respectively. I also use the mirex score, where a prediction is correct if it shares at least three notes with the label. This allows for errors like mistaking C:7 for C:maj. Finally, I use acc, or simply accuracy, to denote the overall accuracy where symbols must match exactly.

Other measures of correctness are sometimes used. These include majmin, a measure of correctness over only major and minor qualities. I use this measure only to substantiate the use of the larger vocabulary in Appendix A.6. Measures of correctness over triads and tetrads are also sometimes used, but these are highly correlated with third and seventh, respectively. This correlation is to be expected as the third and seventh are strong indicators of the triad and tetrad of the chord. This was verified empirically on preliminary experiments which are omitted.

All metrics are implemented in the mir eval library [50], which also provides utilities for calculating WCSR from frame-wise chord predictions. The mean WCSR is computed over all songs in the evaluation set. Some other works report the median. Empirically, I found the median to be ≈ 2% greater than the mean. This may be due to those songs identified as being unsuitable for chordal analysis by Humphrey and Bello [43]. I report only the mean throughout this work. This was chosen for being more commonly used in recent literature and because it is important for a metric to detect if the model performs poorly over certain genres.

For some experiments, two further metrics are calculated. These are the mean and median class-wise accuracies, called acc class and median class , respectively. acc class has previously been defined in terms of discrete frames by Jiang et al. [32]. I redefine acc class here in terms of WCSR and introduce median class . The definitions can be found in Equations 3.1.

C denotes the number of chord classes. WCSR(c) is the WCSR considering only time when chord c is playing. A formal definition of WCSR(c) is also included in Appendix A.2.

These metrics are intended to measure the model’s performance on the long tail of the chord distribution. Measuring both the mean and median is informative as it provides a sense of the skew in performance over classes. While the metric can be defined for any measure of correctness, I report only the acc as I found it to be the most informative. For example, the mean class-wise root score is harder to interpret.

The justification for redefining acc class this way is that metrics calculated over discrete frames are not comparable across different frame lengths and are dependent on the method of allocating chords to frames. Instead, continuous measures evaluate models based on the percentage of time that they are correct. This more closely reflect what we truly desire from the model. To illustrate this, imagine a very long frame length. The model could have perfect scores on these frames but be making terrible predictions for much of the song. Through preliminary experiments, it became clear that there are negligible differences between rankings between models in discrete and continuous measures for sufficiently small hop lengths. Nevertheless, I propose that the field of ACR adopts a continuous measure of class-wise accuracy.

I do not also compute quality-wise accuracies as computed by Rowe and Tzanetakis [33]. Quality-wise metrics only ensure that each root is equally weighted. As roots are fairly balanced, this would not add much information. I therefore do not evaluate using quality-wise metrics.

For most experiments, the metrics on the validation set are used to compare performance. The test set is used only to compare the final accuracies of select models in Section 5.7.

Other evaluation tools are used such as confusion matrices and the number of chord transitions per song. Note that confusion matrices are calculated using discrete frames for ease of computation. In an ideal setting, these would also be calculated using continuous measures. Given the small differences between the two for short frame lengths, I decided it was not worth the additional engineering effort and computational cost.

For

In this chapter, I implement a convolutional recurrent neural network (CRNN) from the literature, train it on the pop dataset and compare it to two baselines. I then conduct a thorough analysis of the behaviour and failure modes of the CRNN and provide motivation for improvements.

I implement a convolutional recurrent neural network (CRNN) as described in McFee and Bello [16], referred to as the CRNN. It remains competitive with state-of-the-art, is often used as a comparative baseline and is fast and easy to train.

The model receives input of size B × F where B = 216 is the number of bins in the CQT and F is the number of frames in the song. The input is passed through a layer of batch normalisation [53] before being fed through two convolutional layers with a rectified linear unit (ReLU) after each one. The first convolutional layer has a 5 × 5 kernel and outputs one channel. It is intended to smooth out noise and spread information across adjacent frames. The second layer has a kernel of size 1 × I and outputs 36 channels, intended to collapse the information over all frequencies. The output is passed through a bi-directional gated recurrent unit (GRU) [54], with hidden size initially set to 256 and a final fully connected layer with softmax activation. This produces a vector of length C for each frame. The chord with the maximum probability is taken as the model’s prediction for each frame.

The authors of the model propose using a second GRU as a decoder before the final fully connected layer, called ‘CR2’. In brief empirical tests, the results with and without ‘CR2’ were very similar. Therefore, I do not include this in the model. Results are left to Appendix A.8 as they are neither relevant nor interesting.

To ensure that the training hyperparameters are set to reasonable values, I conduct a grid search over learning rates and learning rate schedulers. This is followed by a random search over model hyperparameters. Results from Reddi et al. [55] suggest that stochastic gradient descent (SGD) can find better minima with a stable learning rate over many epochs. To test this, I trained a CRNN over 2000 epochs with a learning rate of 0.001, the cosine scheduler and momentum set to 0.9. While the model did converge, it did not perform any better than the models trained with Adam. Results are left to Appendix A.9 for lack of interest.

With this learning rate and learning rate scheduler fixed, I perform a random search over the number of layers in the GRU, the hidden size of the layers in the GRU the training patch segment length, the number of convolutional layers prior to the GRU, the kernel size of these layers and the number of channels outputted by each of these layers. The search is performed by independently and uniformly randomly sampling 50 points over discrete sets of possible hyperparameter values. These sets can be found in Appendix A.10.

I consider two models as baselines. First, I train a single-layer neural network with softmax activation, which treats each frame of each song independently. The layer receives an input of size B = 216 and outputs a C = 170-dimensional vector for each frame. This model is called logistic as it can be viewed as a logistic regression model trained using SGD. I could have used a logistic regression model implemented in sklearn but the implementation as a neural network was fast and easy to implement and unlikely to yield significantly different results. Saving on validation loss improvement also provides a regularisation effect.

Secondly, I train a convolutional neural network (CNN). The number of convolutional layers, kernel size and number of channels are left as hyperparameters. The convolutional layers operate on the CQT similarly to how a convolution operates on an image. A ReLU is placed between each layer. These are followed by a 36-channel 1 × I convolutional layer and fully connected layer as in the CRNN.

I test models of increasing depths, kernel sizes and channels. In general, the deeper models perform better. Two of these models serve as baselines in reported results. The first model has a single layer and channel and a kernel size of 5. It serves as an ablation on the GRU part of the CRNN. This configuration is referred to as CNN1. A second model with 5 layers of kernel size 9, each with 10 channels, is referred to as CNN5.

I perform a grid search over learning rates and schedulers for these baselines to ensure that convergence is reached. Convergence results are not meaningfully different than those obtained with the CRNN and are hence omitted. I use the best-performing results in each case. This was with a learning rate of 0.001 for both models and with schedulers of plateau and cosine for logistic and CNN1/CNN5 respectively.

Table 4.2 shows the results of the CRNN compared with the baseline models. The CRNN performs the best out of these models. The GRU layer improves accuracy by 5.2%. However, similar performance increases can be achieved by adding convolutional layers as in CNN5 as opposed to an RNN. Combined with the lack of performance improvement from increasing the audio patch length observed in Section 4.1.1.2, there is strong evidence that the model does not share information across time very far.

We also observe diminishing performance increases with increased model complexity.

Performance begins to level out with accuracies of roughly 60%. Indeed, the best models trained by Park et al. [42] and by Akram et al. [56] never achieve an accuracy of more than 66%. Humphrey and Bello [43] refer to this as the ‘glass ceiling’ which the field of ACR is still struggling to break through. The problem posed by ACR remains far from solved.

While quantitative metrics summarise how well a model performs over songs, they do not tell us much about the predictions the model makes and where it goes wrong. In this section, I seek to understand behaviour of the model by answering a series of questions.

How does the model deal with imbalanced chord distribution? The class-wise metrics in Table 4.2 give strong indication that the performance is poor. I use a confusion matrix over qualities of chords to provide more granular detail.

The confusion matrix is illustrated in Figure 4.1. The model struggles with rarer chords.

On the rarest quality of majorminor7, the model has a recall of 0. Recall is 0.86 on the major chord but the model consistently predicts major for similar chord qualities like major7 and sus4. A similar effect is observed with minor chords and qualities like minor7. The model frequently confuses diminished 7 chords for diminished chords. This explains the median class-wise accuracies of nearly 0 for all models. I also produce a confusion matrix over roots. This is left to Appendix A.11 as it is less insightful. The model performs similarly over all roots with a recall of between 0.73 and 0.81. This is not surprising as the distribution over roots is relatively uniform. Recall on the no chord symbol N is 0.73. Many of the N chords are at the beginning and end of the piece. The model may struggle with understanding when the music begins and ends. An example of the model erroneously predicting that chords are playing part-way through a song is discussed in Section 4.4.6.

Performance is worse on the unknown chord symbol with a recall of 0.18. The low performance on X is to be expected. It is a highly ambiguous class with many completely different sounds mapped to it. All of the chords mapped to X will share many notes with at least one class in the known portion of the vocabulary. It is therefore unreasonable to expect the model to be able to predict this class well. This supports the case for ignoring this class during evaluation as is standard in the literature.

Are predictions worse on frames where the chord changes? Such transition frames are present because frames are calculated based on hop length irrespective of the tempo and time signature of the song. Thus, some frames will contain a chord transition.

To test this, I compute accuracies for transition and non-transition frames separately.

The model achieves only 37% on the transition frames compared with 61% on nontransition frames. Therefore, the model is certainly worse at predicting chords on transition frames. Nonetheless, the CRNN achieves an overall accuracy of 60%. This is because only 4.4% of frames are transition frames with a hop length of 4096. Improving performance on these frames to the level of non-transition frames would increase the overall frame-wise accuracy by at most 1%.

Through qualitative evaluation discussed in Section 4.4.6, the model was found to struggle with identifying the boundary of a chord change on some songs. This would not be captured by the above metrics if the boundary is ambiguous enough to span multiple frames. Thus, there may be a larger impact in accuracy than a single frame. Furthermore, the ambiguity of chord transition timing will vary over songs. For some songs, this may be the main limiting factor in performance.

Are the models outputs smooth? There are over 10 frames per second. If the model outputs rapid fluctuations in chord probability, it will over-predict chord transitions. I use two crude measures of smoothness to answer this question.

Firstly, I look at the number and length of incorrect regions. Such a region is defined as a sequence of incorrectly predicted frames with the same prediction. 26.7% of all incorrect regions are one frame wide and 3.7% of incorrect frames have different predictions on either side. This suggests that at least 3.7% of errors are caused by rapidly changing chord predictions. A histogram over incorrect region lengths can be found in Appendix A.12. This plot shows that the distribution of lengths of incorrect regions is long-tailed, with the vast majority very short.

Secondly, I compare the mean number of chord transitions per song predicted by the model with the true number of transitions per song in the validation set. The model predicts 168 transitions per song while the true number is 104. This is convincing evidence that smoothing the outputs of the model could help.

With these two observations combined, I conclude that further work on the model to improve the smoothness would might performance. Although we might hope to improve on at least 3.8% of errors, this would not improve overall accuracy very much. While rapid changes may be smoothed out, there is no guarantee that smoothing will result in correct predictions. Indeed, it may even render some previously correct predictions erroneous. Nonetheless, the model predicts too many chord transitions. When being used by a musician or researcher, smoothed predictions would be valuable in making the chords more interpretable.

How much does the model rely on context? I hypothesise that the model is worse at predicting chords at the beginning and end of a patch of audio as it has less contextual information close to these frames.

To test this, I evaluate the model using the same fixed-length validation conducted during training as described in Section 3.3. Average frame-wise accuracies over the context are then calculated. A plot can be found in Appendix A.13. I use a segment length of 10 seconds corresponding to L = 107 frames. We observe that performance is worst at the beginning and end of the patch but not by much. Performance only dips by 0.05 at either extreme, perhaps because the model still does have significant context on one side. We can also see that performance starts decreasing 5 or 6 frames from either end, suggesting that this is the extent to which bidirectional context is helpful.

I conduct a further experiment measuring overall accuracy with increasing segment lengths used during evaluation. Results can be found in Appendix A.14. The plots show that accuracy increases by 0.5 after increasing the segment length from 5 seconds to 60 seconds. Although this is not much of an increase, it confirms that it is better to evaluate over the entire song at once.

Does the model have consistent performance over different songs? The set of accuracies over songs of the CRNN has a standard deviation of 13.5. This suggests that performance is not stable over songs. To provide further insight, I plot a histogram of accuracies and mirex scores over the validation set in Figure 4.2. We observe that the model has mixed performance with accuracy, with 15% of songs scoring below 40%.

When we use the more generous mirex metric, there are very few songs below 40% and only 7% are below 0.6. This large discrepancy between accuracy and mirex suggests that many of the mistakes that the model makes are small. These mistakes are a good guess in the sense that the prediction may have omitted a seventh or mistaken a major 7 for its relative minor. Examples of such mistakes are discussed in Section 4.4.6.

I conclude that many of the model’s predictions are reasonable but often lack the detail contained in good annotations like correct upper extensions. Whether these reasonable guesses are correct can vary widely over songs. Accuracies are mixed, with 15% of songs below 40%, and 69% between 0.4 and 0.8. However, with the more generous mirex metric, we find that there are almost no songs below a score of 40% and only 7% below 0.6. This suggests that many of the mistakes the model makes are small, like predicting C:maj instead of C:maj7. The very low outliers in the mirex score were found to be songs with incorrect annotations found in Section 3.1.1.

Now, let us inspect predictions for a few songs to see how the model performs. In Mr. Moonlight, there are few differences between the accuracy and mirex score. There are regular, repeated errors, many of which are mistaking F:sus2 for F:maj. This is an understandable mistake to make, especially after hearing the song and looking at the annotation where the main guitar riff rapidly alternates between F:maj and F:sus2. The confusion matrix in Figure 4.1 suggests this mistake is very fairly common on qualities like sus2 which are similar to maj.

In Ain’t No Sunshine, the mirex score is significantly higher than the accuracy. This is because the majority of the mistakes the model makes are missing a seventh. For example, the model predicts A:min7 for the true label of A:min7 or G:maj for G:7.

Other mistakes that mirex allows for include confusing the relative minor or major such as predicting E:min7 when the chord is G:maj. All of these mistakes occur frequently in this song. The mean difference between the accuracy and mirex is 18.7%, with one song reaching a difference of over 70%. Hence, we can attribute many of the model’s mistakes to such behaviour. ‘Ain’t no Sunshine’ also contains a long incorrect section in the middle. This is a section with only voice and drums, which the annotation interprets as N symbols, but the model continues to predict harmonic content. The model guesses A:min throughout this section. This is a sensible label as when this melody is sung elsewhere in the song, it is labelled as A:min7.

In the next two songs, Brandy and September, the model’s mistakes are less interpretable. While performance is okay on Brandy with a mirex of 75.6%, the model struggles with the boundaries of chord changes, resulting in sporadic short incorrect regions in the figure . In ‘Earth, Wind and Fire’, the model struggles with the boundaries of chord changes and also sometimes predicts completely wrong chords which are harder to explain. Listening to the song and inspecting the annotation makes it apparent that this is a difficult song for even a human to annotate well, and similarly, the model does not fare well.

Below, I summarise the main takeaways from this section and motivate further improvements to the model.

Performance on rare chord classes is poor. There are few instances of chord classes with complex qualities and upper extensions. The model ends up predicting major and minor classes for these rare chords. There are many methods of addressing an imbalanced distribution in machine learning. The simplest is to add a weighting to the loss function which I explore in Section 5.2.1. I also look at a ‘structured’ loss function which exploits similarity between chords in Section 5.2.2. Performance might also be improved through better data. I explore the use of data augmentation in Section 5.4 and synthetic data generation in Section 5.5.

Predictions are not smooth. While it is unclear whether or not smoothness will improve performance, a good chord recognition model’s predictions would be smooth. Musicians do not expect chords to change every 93ms. This motivates the exploration of a ‘decoding’ step in Section 5.1.

The model does not use long-range context. CNN1 only shares context a maximum of 5 frames either side as this is its kernel size. It achieves an accuracy of 54.5%, just 5% less than the CRNN. This suggests that most of the performance gain associated with including contextual information is neither complex nor far-reaching. I conclude that while a little context improves performance, the CRNN does not use context in a complex manner.

The model is simple. The analysis of feature maps by Korzeniowski and Widmer [45] corroborate up with this idea. Their analysis suggests that the model detects the presence of individual notes and decides which chord is present based on these notes. This is why more parameters do not help. Unfortunately, this results in many similar chords being confused. The root note can be wrong. Similar qualities are often mistaken. Predictions often miss upper extensions. This offers an explanation for the large discrepancy between accuracy and mirex score with average values of 60% and 79% over the validation set, respectively. This problem is exacerbated by the imbalance in the dataset discouraging the model from being sensitive to indicators of rare chord classes.

Performance is song-dependent. Accuracies over songs vary widely. mirex scores are more consistent but still vary. A detailed analysis of the properties of songs with poorer performance would be valuable work. I will not explore this further here beyond further qualitative analysis.

The model struggles a little on transition frames. Solely improving performance on such frames is unlikely to improve metrics by much.

A much better reason to segment chords is to give the output of the model a far more interpretable meaning. The frame-wise correctness plots illustrated in Figure 4.3 are not musically interpretable. Even if chord symbols were added, this would not constitute good musical notation. Musicians do not operate over 93ms frames. They think of music as existing in beat space. Poltronieri et al. [57] explore the related task of finding chord boundaries in audio given the chord sequence. Instead, I take inspiration from Chris Donahue and Liang [20] and use a beat detection model to task the model with predicting chord over beats rather than frames in Section 5.6.

In this chapter, I use the insights from Chapter 4 to improve the CRNN and address questions raised by the literature. I perform a series of experiments to test improvements to the model, evaluate a selection of models on the test set and perform a qualitative analysis of the model’s outputs.

Many of these experiments introduce new hyperparameters. I choose these hyperparameters in a greedy fashion and keep them as specified unless stated otherwise.

While the assumption of independence between hyperparameters is undoubtedly wrong, performing a full hyperparameter search is computationally infeasible.

I first conduct experiments verifying that CQTs are the best features for ACR and that a hop length of 4096 is appropriate. These experiments are detailed in Appendix A.5. To summarise, CQTs achieve 10% greater accuracy than other spectrogram variants and any hop length less than 4096 achieves similar results. Thus, I proceed with CQTs and a hop length of 4096.

As observed in 4.4.3, the CRNN predicts 168 transitions per song as opposed to the 104 seen in the ground truth data. I implement a decoding step over the frame-wise probability vectors to smooth predicted labels. Common choices for decoding models include a conditional random field (CRF) [32,42] and a hidden Markov model (HMM) [21].

I first implement a HMM. The HMM treats the frame-wise probabilities as emission probabilities and the chord labels as hidden states. O’Hanlon and Sandler [58] note that using a transition matrix with homogeneous off-diagonal entries in the transition matrix performs similarly to using a learned transition matrix. I adopt such a transition matrix for this HMM, with a parameter β denoting the probability of self-transition and all other transition probabilities equal to 1-β C-1 . Decoding then follows the Viterbi algorithm [59] over the summed forward and backward pass.

A plot of the effect of β on the model’s performance and the number of transitions per song is shown in Figure 5.1. From this plot, we conclude that smoothing has little effect on the performance of the model while successfully reducing the number of transitions per song to that of the true labels. I choose β = 0.15 for the remainder of experiments as it results in 102 transitions per song while maintaining high performance. The effect of the HMM on the incorrect regions previously discussed in Section 4.4.3 can be found in Appendix A.12. The HMM reduced the percentage of incorrect regions which are a single frame long from 26.7% to 16.7%. A more intuitive way to see the effect of the HMM is to look at a section of a song for which the model previously predicted many chord transitions for. This is illustrated in Appendix A.15.

I also implement a linear chain CRF using pytorch-crf. 1 In contrast to the HMM, the CRF uses a learned transition matrix. Results comparing the HMM, CRF and no smoothing can be found in Table 5.1. Both the CRF and HMM reduce the number of transitions per song to a similar level. The HMM outperforms the CRF with 3.5% greater accuracy. The HMM has almost identical performance to the model with no smoothing. I hypothesise that the learned transition matrix allows the model to overfit to the chord sequences in the training set. Regardless of the explanation, I proceed with HMM smoothing.

One of the most significant with the CRNN is the low recall on rarer chord qualities. loss function and over-sampling. Rowe and Tzanetakis [33] also explore the use of curriculum learning as a form of re-sampling which we do not explore here because they report only minor performance gains. Sampling is explored by Miller et al. [21], but they use a different model based on pre-computing chroma vectors and re-sampling these chroma vectors for use in training a random forest for frame-wise decoding.

In our setting, re-sampling training patches of audio may be interesting to explore but is left as future work. It would require a complex sampling scheme as frames cannot be sampled independently.

Weighting has been explored by Jiang et al. [32]. We employ a similar but simpler implementation here. A standard method of weighting is to multiply the loss function by the inverse frequency of the class of the current training sample, with a parameter controlling the strength of the weighting. This is defined in Equation 5.1.

Where w c is the weight for chord c, count(i) is the number of frames with chord c in the dataset and α is a hyperparameter controlling the strength of weighting. α = 0 results in no weighting and increasing al pha increases the severity of weighting. I add 10 in the denominator to avoid dividing by 0 and to diminish the dominating effect of chords with very few occurrences. I then define normalised weights w * c in Equation 5.2 so that the learning rate can remain the same.

Where C is the set of all chords in the vocabulary. This keeps the expected weight over samples at 1 such that the effective learning rate remains the same. These values are calculated over the training set. I test values of α in the set {0, 0.05, 0.1, . . . , 0.95, 1}. The plot in Figure 5.2 illustrates the effect of the weighting on the model’s performance. I find that increasing α improves acc class but decreases root accuracy.

Choosing α = 0.3 maximises acc class without hurting root accuracy which I carry forward to subsequent experiments. For further insight, a plot of the differences between confusion matrices with and without weighted loss can be found in Appendix A. 16. Notably, recall on most qualities increases, with recall on major7 doubling to 0.34. The weighted model predicts 2.2 times fewer X symbols, which may explain how it increases recall on these rarer qualities without sacrificing accuracy.

Weighting the loss function also slightly increases the number of transitions predicted per song. This may be because occasional sharp gradient updates cause more extreme probability outputs. I increase the HMM smoothing parameter β to 0.2 to bring the number of transitions per song to 104.

McFee and Bello [16] propose a structured loss function, which they claim improves performance on the CRNN model. They introduced additional targets for the root, bass and pitch classes. I follow a similar method but do not include the bass as the current chord vocabulary does not consider inversions. The idea behind this loss term is to explicitly task the model with identifying the components of a chord we care about. This can allow the model to exploit structure in the chord vocabulary such as shared roots and pitch classes, rather than all symbols being predicted independently.

The root can be any of the 12 notes in the Western chromatic scale, N or X, creating a 14-dimensional classification problem. The 12 pitch classes each represent a single binary classification problem. Two fully connected layers calculate a 14-dimensional vector and 12-dimensional vector from the hidden representation outputted from the GRU for the root and pitch classes, respectively. Finally, these representations are concatenated with the GRU representations and fed into the final fully connected layer to predict the chord symbol.

The mean cross-entropy loss is calculated in each case. These are summed to form the structured loss. Finally, a linear combination of the structured and original losses is calculated. The final loss is a convex combination of the original loss and the structured loss as defined in Equation 5.3.

Where L is the overall loss, L chord is the cross-entropy loss over chords symbols, L root is the cross-entropy loss targeting the root, L pitch is mean binary cross-entropy over each of the pitch classes. γ is a hyperparameter controlling the weighting of the original loss.

I test models with γ ∈ {0, 0.1 . . . 0.9}. Choosing γ = 0.7 improves accuracy by 1.3% while mirex worsens by 0.3%. Accuracy with third increases by 1.7% and on seventh by 1.3%. Generally, greater γ improves accuracy metrics while mirex results are noisy. A plot of the trend can be found in Appendix A.10 but does not provide further insight. I keep γ = 0.7 from now on based on peak accuracy.

Chris Donahue and Liang [20] use generative features extracted from Jukebox [8] to improve performance for melody transcription. They also produce a chord transcription model using the same methodology but do not report results. I decide to test generative features using MusicGen [9] as a feature extractor. This was for several reasons.

MusicGen is a newer model. It has several different sizes of models that can be tested against each other as an experiment on the complexity of the model. It has a finetuned variant called MusiConGen [38] which is used for synthetic data generation in Section 5.5. All model weights are available on the HuggingFace Hub. Surprisingly, the concatenated representation performs worse than the averaged representation as it contains at least as much information. However, if the information provided by each codebook is essentially the same, then there are no reasons that the concatenated representation should perform better and training with Adam may simply find a worse minimum. Training on 8192-dimensional vectors is also computationally expensive.

To test whether or not these features help when compared with a CQT, I test with the CQT only, generative features only and a concatenation of the two. The results are shown in Table 5.2. Although the generative features perform worse than the CQT, they contain information useful for chord recognition with an accuracy of 58.7%. Performance remains largely the same when the CQT and generative features are used together. This experiment was run multiple times, with similar results each time. There is no clear evidence that the generative features provide any benefit over just using the CQT.

This conclusion is surprising as Chris Donahue and Liang [20] claim that generative features are better than hand-crafted features for the related task of melody recognition. However, they only compare to mel-spectrograms, which may not perform as well as CQTs, as they certainly do not for chord recognition. Observations here cast doubt on how well their claims generalise to chord recognition. Features extracted from other generative models such as Jukebox [8] or MusicLM [60] may perform better. The comparison is left for future work.

Given the lack of improvement and the drastically increased computational cost associated with extracting features and training the model, I do not proceed with training on generative features.

Pitch augmentation has been done in other works on chord recognition, either on the CQT [32] or on the audio [42,16]. Although similar, these are not identical transformations. Shifting the CQT takes place after discarding phase information and leaves empty bins behind, whereas audio pitch shifting can introduce other artefacts intended to preserve harmonic structure and maintain phase information. I implement both methods and compare them.

When a sample is drawn from the training set, it is shifted with probability p. The shift is measured in semitones in the set {-5, -4 . . . -1, 1, . . . 6} with equal probability of each shift. This results in 12 times as much training data. Convergence is still reached in 150 epochs. Shifting the CQT matrix is done by moving all items up or down by the number of bins corresponding to the number of semitones in the shift. The bins left behind were filled with a value of -80dB. Audio shifting is done with pyrubberband. 3 . CQTs are then calculated on the shifted audio. A plot of the effect of the shift probability p on the model’s performance can be found in Figure 5.3.

Results show a clear trend that increasing p improves performance. Shifting the audio provides a very similar effect to simply shifting the CQT. Choosing p = 0.9 results in a 2.1% increase in accuracy. The mirex score breaks the trend with performance varying over different values for p. acc class also improves by more than 2%. This can be explained by the model becoming becoming root-invariant. With p = 0.9, all potential roots become close to equally likely. I proceed with pitching shifting with p = 0.9 on the CQT for the remainder of the experiments as it is computationally cheaper than shifting the audio.

McFee and Bello [16] claim an increase of 5% on the median across most metrics. I do not find such a large effect here. Nonetheless, pitch shifting is a useful augmentation.

Note that the weights for the weighted loss are calculated based on expected counts, taking into account the shift probability p. I also test shifting on both the CQT and audio but the results are not different than shifting with either method alone. Unfortunately, it was not computationally feasible to test pitch shifting with generative features as the feature extraction over 12 * 1210 = 14, 520 songs is too expensive.

Given the success of pitch augmentation for ACR, it is sensible to look for other sources of data. Further augmentation is possible by adding noise and time-stretching. However, these do not provide new harmonic structures or create new instances of rare chords, so I do not explore these here. Instead, I look to generate new data, taking into account our understanding of harmonic structure. Generation would be possible through automatic arrangement, production and synthesis software. However, this is a complex task, requires a lot of human input and is unlikely to produce sufficient variety of timbres, instrumentation and arrangement. Instead, I use a recent chord-conditioned generative model called MusiConGen [38]. I use this over CocoMulla [61] as the method of chord-conditioning has a more straightforward interface, doesn’t require reference audio and the authors claim that it adheres more closely to its conditions. Indeed, they feed the outputted audio through BTC [42] and find a triads score of 71% using the chord conditions as the ground truth. While far from perfect, it suggests the model can generate audio that mostly adheres to the chord conditions.

I generate 1210 songs, each 30 seconds long, to mimic the size of the pop dataset. I refer to this dataset as synth. This is split into a train, validation and test split in the same fashion as for the pop dataset. Generating a larger order of magnitude of songs would require a lot of compute. While the model supports auto-regressive generation for longer audio, its outputs become incoherent using the provided generation functions. It also sometimes produces incoherent output with 30 seconds of generation, but this was much less common.

To generate a song, I sample a BPM from a normal distribution with mean 117 and standard deviation 27, clipped to lie in the range [60,220]. These values were calculated from the training set. I then sample a song description from a set of 20 generated by ChatGPT. The descriptions outline a genre, mood and instrumentation. Descriptions include only jazz, funk, pop and rock, which were all part of the fine-tuning training set for MusiConGen. The model does not output melodic vocals, owing to the lack of vocal music in the pre-training and fine-tuning data. Finally, I generate a jazz chord progression using the theory of functional harmony Steedman [62]. Details of this generation process can be found in Appendix A. 23. The process generates a very different chord distribution than the one in pop with many more instances of upper extensions and rare qualities. This is intended to provide the model with many more instances of rare chords.

To offset this distribution shift, I calibrate the probabilities outputted by the model. To encourage root-invariance, calibration terms are averaged over roots for each quality. The details of calculating calibration terms are described in Appendix A.24 with a figure showing the calibration terms for each quality. This figure shows that the rarer qualities are much more common in the synthetic data.

I manually inspected outputs from MusiConGen. In general, the outputs are good. They consistently stick to the provided BPM and usually stick to the chord conditions. Outputs are occasionally musically strange with jarring drum transients and unrealistic chord transitions. They also did not have an enormous variety in timbre and instrumentation. Nonetheless, most examples have sensible annotations that one would expect a human to be able to annotate well.

I compare a model trained on only pop, trained on only synthetic data and trained on both. While the latter results in more training data per epoch, convergence is always reached, so these are fair comparisons. I test the models on the pop and synthetic data validation splits. Given the increased instances of rare chords in the synthetic data, I remove the weighting on the loss function for the models trained on synthetic data. pop results in an accuracy of just 24.2% on synth, much lower than reported by Lan et al. [38]. However, this may be due to the unrealistic distribution of chords in the generated sequences.

For further insight, the difference in confusion matrices over qualities is plotted in Appendix A. 25. There are few clear trends. Recall increases for some rare qualities and decreases for others. Notably, recall on min7 improves by 13% and the model is much better at predicting the third in dominant and minor 7 chords. This is likely why the third and seventh accuracies increase slightly on the pop validation set.

While performance does not meaningfully improve on pop, the lack of overfitting and the gain on the synthetic data provide hope that synthetic data could prove a useful tool for ACR with further work and improved generative models. Furthermore, producing handannotated data is error-prone and subject to human interpretation. Synthetic data may produce non-identifiable data points but is consistent and error-free. Poor performance on synthetic data can only be explained by failures of the model or indeterminacy of the generated samples, not incorrect labels. Regardless, given the lack of performance increase, I do not continue to train on synthetic data.

Chords exist in time. Musicians interpret chords in songs as lasting for a certain number of beats, not a fixed length in time. In its current form, the model outputs framewise predictions. While these could be stitched together to produce a predicted chord progression or beat-wise predictions could be made as a post-processing step, I decide to implement a model that outputs beat-wise predictions directly. This allows the model to use information from the entire duration of the beat to make its prediction.

Following the methodology of Chris Donahue and Liang [20], I first detect beats using madmom [63]. This returns a list of time steps where beats have been detected. I first verify that these beats are plausible. I perform a cross-correlation analysis with the chord transitions similarly to Section 3.1.1. A histogram of maximum lags within a window of 0.3 seconds can be found in Appendix A. 26. Almost all maximum lags occur within a window of 0.1 seconds. To provide further evidence, I compute the maximum accuracy a model could attain if predicting chords at the beat level. This is done by iterating over each beat interval and assigning the chord with maximum overlap with the ground truth. This yields an accuracy of 97.1%. With these observations combined, I am satisfied that the estimated beats are accurate enough to be used.

To calculate features for a beat interval, I average all CQT features whose centre is contained within the interval. The CQT is calculated using a hop length of 1024. The shorter hop length is used to minimise the effect of CQT frames with partial overlap with two beat intervals and ensures that each beat has many CQT frames associated with it. These representations have the added benefit of decreased computational cost as beats have a lower frequency than frames.

I test the model with different divisions and groupings of beats. This tests the assumption that whole beats are fine-grained enough for the model to make good predictions. I include tests where beat intervals are sub-divided into two or four, or beat intervals are joined into groups of two. I refer to this as the beat division. I also test a perfect beat division where the true chord transitions are taken as the beats. This is not a fair comparison as the model should not have access to true transition timings. However, it does provide an idea of how the model would fare if it could perfectly predict chord transitions. HMM smoothing is removed for beat-wise models as it is no longer necessary.

Results are shown in Table 5.4. Using beat-wise predictions does not affect performance compared to frame-wise predictions. The model performs just as well with a beat division of 1 as with a beat division of 1/2 or 1/4. The model performs worse with a beat division of 2 though it only loses 6% accuracy. This suggests that most chord transitions occur at frequencies smaller than the beats produced by madmom but that sometimes the two are misaligned.

The model with ‘perfect’ beat intervals performs slightly worse on accuracy metrics but attains a very high mirex score of 90.4%,a which is the highest of any seen in the literature. This is a very promising result. Data from the CQT producing a mirex score of 90.4% provides hope for significant improvements in the field. Why the model performs worse on accuracy metrics is not clear. It may be because averaged features from longer time periods provide better information as to the pitch classes present but dampen signal regarding the root note. Indeed, the mean chord duration is 1.68 seconds while a CQT frame is 0.093 seconds. Why this effect is not observed when predicting at the beat level is also not clear. Further analysis may reduce the gap between mirex and accuracy and improvements on ACR.

beat division acc root third seventh mirex

For final results, I retrain select models on the combined training and validation splits and test on the held-out test split. This is an 80/20% train/test split. I consider the original CRNN with no improvements, the CRNN with a weighted and structured loss and HMM smoothing, concatenating generative features with CQTs, pitch augmentation, beat-wise predictions, ‘perfect’ beat-wise predictions and training on synthetic data.

Results are shown in Table 5.5. Observations are largely similar to those found previously. Weighted and structured loss with smoothing improves accuracy by 1.2% and pitch shifting improves accuracy by a further 1%. Generative features do not help and synthetic data improves performance by an additional 0.6%. This alone is not a clear enough signal that training on synthetic data is better, but it provides hope for further work. Finally, beat-wise predictions maintain the same performance. The ‘perfect’ model achieves the highest performance across all metrics. The mirex of 90% found on the validation set has reduced to 88.7%, and the gap with accuracy has narrowed compared to previous results in Table 5.4. This is still a significant result and suggests that there is hope for breaking through the ‘glass ceiling’.

To The second song is Roxanne by The Police, illustrated in Figure 5.5. Syncopation, ambiguous bass, and sliding vocals make this song harder to annotate. Root recall is almost 80% but mirex is only 64%. Sevenths are often omitted and thirds are sometimes wrong as well. The model confuses major and minor, as well as sus4 and major qualities.

There are also some predicted chord transitions that are not present in the ground truth and chord transitions are not all well-timed.

Overall, outputs are much smoother than found previously in Section 4.4.6 with more cohesive predictions of the same chord for a series of frames. However, the problem associated with identifying rarer chord qualities remains. Performance is also still highly song-dependent. The model’s accuracies over songs in the test set have a standard deviation of 20%. As a final example, frame-wise and beat-wise predictions are compared on two songs not part of the dataset in Figure 5.6: Someone Like You by Adele and Misty by Ella Fitzgerald. In both cases, similar chord information is visualised differently. From a musician’s perspective, frame-wise predictions lead to ambiguity over how long a chord lasts in beats. Frame-wise block lengths hint at beats but are not uniform. With time changes, this would become more problematic. In regions with rapid chord transitions, chords are not musically interpretable. In contrast, beat-wise predictions resemble musical notation more closely, and longer beat intervals prevent rapid chord changes. However, beat-wise output can still be hard to interpret when predictions occur part-way through bars.

Conclusions and Further Work

In this thesis, I have presented a thorough analysis of deep learning in automatic chord recognition. There are a few key takeaways.

ACR models are not complex. Good performance relative to state-of-the-art can be achieved with few parameters. It is likely that the task of determining which pitch classes are present from a CQT is a relatively simple operation for a neural network to learn. Performance does not increase with model size past a low threshold.

There are several explanations for the low ceiling on performance and the gap between mirex score and accuracy. First, annotations are too ambiguous or inconsistent for classifiers to learn the upper extensions of chords. Further research on inter-annotator agreement on this dataset is required to assess whether or not this is the case. Second, there are too few instances of rare chord classes. This leads to the current models failing to learn signals indicating the presence of such classes. Third, the current models are unable to use information from a wider context to discern chord qualities. Different genres or repetitions of the same chord within a song may give further clues about the musically coherent chord quality. Whatever the reason, chord recognition models are unlikely to become more useful than crowd-sourced annotations without addressing this issue.

Without smoothing, frame-wise predictions result in too many chord transitions. Of the smoothing methods tested, fixed transition matrices are preferred. Weighting the loss function allows control over performance on rare qualities but requires sacrificing overall accuracy. Introducing structured representations of chords as additional targets provides a small performance gain. Features extracted from MusicGen contain information relevant to ACR but not any more than is already contained in the CQT.

Pitch augmentation works well to encourage root-invariance and improve accuracy. The use of synthetic data provides an exciting avenue for future research. Results presented here show signs that with newer models and more careful construction, synthetic data could provide many new training examples with a customisable chord distribution.

Predicting chords over beats instead of frames improves the interpretation of the model’s outputs while performance is unaffected. Predicting chords over the true chord intervals results in the highest mirex score seen in the literature, suggesting that there are gains to be had through accurately detecting chord transitions.

While deep learning models are powerful chord recognisers, much work remains before the problem is solved. The ‘glass ceiling’ has yet to be broken but the work presented here provides a solid foundation for future research and hope that the true ceiling is much higher.

Many of the experiments conducted would benefit from further analysis. Implementing a sampling method which prioritises rare qualities may yield improved results over a weighted loss function. Looking at alternative methods of structuring chords beyond the pitch classes present may improve results, like the work of Jiang et al. [32]. Larger generative models trained on a broader variety of songs may produce better representations for ACR. The work presented here also highlights new avenues of research.

Multiple Data Sources. Results on synthetic data show enough promise to continue this line of research. A more closely controlled chord sequence generation process may help. For example, one could construct examples designed to teach the differences between different seventh qualities and look at the effect on recall on seventh qualities to see if they improve. Other datasets also exist such as HookTheory. I was not able to obtain audio from this source. However, results here suggest that gathering more data from the same distribution may not help. A better data source might be JAAH, which would enable comparisons across genres and chord distributions.

Finding better chord transitions than beats. The high mirex score found in Section 5.6 suggests two things. First, targeting the problem of identifying chord transitions rather than beats may yield better results. Durán and de la Cuadra [47] jointly estimate beats and chords, but to the best of my knowledge, no modern work has jointly estimated chord transitions and chord symbols. Second, current models are missing information regarding the presence of pitch classes that are present in the CQT. Perhaps this information is spread out in time or obscured by nearby frames that are irrelevant to the current chord. Understanding this effect may lead to new insights.

Subjective annotations. Inter-annotator agreement of the root of a chord is estimated at lying between 76% [64] and 94% [65] but these metrics are calculated using only four and two annotators, respectively. Humphrey and Bello [43] posit that agreement between annotations can be far lower than that for some songs. Analysis of such an effect on commonly used datasets would provide a valuable contribution to the field. Such analysis could be used to inform the design of more subtle chord annotations that take multiple annotations and uncertainty into account.

A statement regarding the limitations of the conclusions presented here and ethics of musical machine learning models can be found in Appendix A.1.

spectrograms and 2048 fast Fourier transform (FFT) bins for the linear spectrogram with a hop length of 4096 for all.

Results show that CQTs are the best choice. This raises questions as to the validity of the conclusions drawn by Chris Donahue and Liang [20]. They claim that their generative features are better than hand-crafted features. However, they only compare to mel-spectrograms which may not perform as well as CQTs for the related task of melody recognition. The CQT is also better the chroma-CQT. We can be confident that the model is using information from multiple octaves more efficiently than the simply summing across octaves.

Different hop lengths have been used to calculate the CQT ranging from 512 [32] up to 4096 [16]. In previous experiments I have used a hop length of 4096 as is used by the authors of CRNN [16]. Shorter frames would reduce the number of transition frames but require more computational cost. If frame lengths are too short, the Fourier transform may not be able to capture the harmonic structure of the audio.

In Table A

Some initial experiments were conducted over a smaller vocabulary with C = 26. This vocabulary includes a symbol for major and minor over each root and two special symbols, N and X for no chord and unknown chord respectively. This contrasts the much larger vocabulary with 14 chord qualities for each root which is used for the majority of the experiments. With this larger vocabulary, C = 170. labels are all mapped to the small vocabulary before being evaluated in the same was as described in Section 3.2. This test was to verify that the model trained on the larger vocabulary performs well competitively on the smaller vocabulary. If the model trained on the larger vocabulary performed poorly on the smaller vocabulary, it may be prudent to first try to improve performance on this smaller vocabulary. It may also be a sign that the larger vocabulary is too complex or that the more detailed annotations are inconsistent.

However, the table shows very similar performance between both models. This allows us to proceed with the larger vocabulary for the rest of the experiments. The larger vocabulary is also more consistent with the literature and allows for a model to produce far more interesting chord predictions than simply minor, major and root.

Vocab C acc root small 26 76.7 80.1 large 170 76.0 79.1 Table A.4: CRNN with a small and large vocabulary. Metrics show similar performance between the two. Training on the large vocabulary does not prevent the model from learning how to classify the smaller vocabulary. Thus, I proceed with the larger vocabulary.

Note that the mir eval package also includes a majmin evaluation metric that compares chords over just the major and minor qualities. However, this is not quite the same as the test above due to subtleties in how mir eval chooses whether or not a chord is major or minor. It ends up ignoring many chords that could be mapped to these qualities in the smaller vocabulary. Coincidentally, the CRNN with the default parameters attains a majmin accuracy of 76.0% over the larger vocabulary. This further confirms that we need not continue to test on the smaller vocabulary. The majmin metric is not used in the rest of the thesis as it is not as informative as the other metrics and the third metric is highly correlated with it.

Chords in Harte notation were mapped to the vocabulary with C = 170 by first converting them to a tuple of integers using the Harte library. These integers represent pitch classes and are in the range 0 to 11 inclusive. They are transposed such that 0 is the root pitch. These pitch classes were then matched to the pitch classes of a quality in the vocabulary, similar to the work by McFee and Bello [16]. However, for some chords, this was not sufficient. For example, a C:maj6(9) chord would not fit perfectly with any of these templates due to the added 9th. Therefore, the chord was also passed through Music21’s [66] chord quality function which matches chords such as the one above to major. This function would not work alone as its list of qualities is not as rich as the one defined above. If the chord was still not matched, it was mapped to X. This additional step is not done by McFee and Bello [16] but gives more meaningful labels to roughly one third of the chords previously mapped to X. A.12 Incorrect Region Lengths With/Without Smoothing However, the differences are only 0.05. We propose that the context on one side is enough for the model to attain the vast majority of the performance attained with bi-directional context. This plot supports our procedure of evaluating over the entire song at once.

Overlapping 5 second chunks of audio were fed through MusicGen in a batched fashion. This first requires passing the audio through the pre-trained Encodec audio tokeniser [10]. These are then fed through the language model. I take the output logits as the representation for each frame. The model outputs logits in four ‘codebooks’, each 2048-dimensional vectors, intended to represent different granularities of detail in the audio. Audio segments are overlapped such that every frame has context from both directions. The multiple representations for each frame are averaged. Finally, these representations are upsampled. The model operates at a frame rate of 50Hz. To compute a representation with the same frame length as the CQT, I take the mean over the frames outputted by the model closest to the centre of the CQT frame. In case averaging over frames dampened the signal, I also tried linearly interpolating between the two closest frames outputted by the model. However, this was empirically found to perform slightly worse. Results are left to Appendix A. 19. This feature extraction required the use of NVIDIA RTX A6000 GPUs. The extraction process takes 4 hours for each model over the entire dataset.

to create them. The rules are based on the relationships between the chords and the keys they are in. For example, a chord progression that moves from a tonic chord to a dominant chord is said to be following the rule of dominant function. This is a common rule in jazz music and is often used to create tension and resolution in a piece of music.

I first decide whether we are in major or minor, each with probability 0.5. I then uniformly sample a tonic from the set of notes in the Western chromatic scale. From this tonic, seven functional chords are decided before sequence generation. These are all probabilistic. For example, the tonic chord is always the tonic, but the dominant chord can be of maj, 7, sus4, aug or dim7 qualities. The probabilities are user-tuned but do not matter very much to the funcionality of the synthetic dataset.

For chord sequence generation, various rules are followed in a probabilistic manner.

Progressions have a random length, uniformly sampled in the range [4,10].

• Tonic (I) may move to predominant chords (ii, IV, vi) or occasionally mediant (iii).

• Predominant chords (ii, IV) resolve to the dominant (V).

• Dominant (V) usually cadences back to tonic (I) or sometimes moves to vi.

• Tonic substitute (vi) leads to ii or iii.

• Mediants (iii) feed into vi.

• Unspecified or fallback transitions are routed toward ii to maintain forward motion.

A time-aligned chord sequence is then calculated in a similar format to that provided by the jams package. This assumes that each chord is played for one bar, that the BPM is always followed, and that MusiConGen simply loops over the chord progression if the end is reached. These assumptions were found to hold on manually inspected examples.

For further details and exact probabilities used, please refer to the provided code.1

To correct for the distribution shift between synthetic training data and the pop train split, I estimate the empirical class probabilities in each domain-P train (y) and P pop (y)-and rescale the model’s logits by the ratio r(y) = P pop (y) P train (y) . (A.5)

In order for calibration to be root invariant, I take the mean ratio over chords that share the same quality, and a single calibration factor r q is applied to every chord with that quality.

Three variants of the dataset are used for training, validation and testing. For training, an epoch consists of randomly sampling a patch of audio from each song in the training set. The length of this sample is kept as a hyperparameter, set to 10 seconds for the majority of experiments, based on values provided by McFee and Bello[16] and a hyperparameter search found in Section 4.1.1.2. For evaluation, the entire song is used as performance was found to be marginally better. This is later discussed in Section 4.4.4. When validating mid-way through training, songs are split into patches of the same length as the training patches to save on computation time. Samples in a batch are padded to the maximum length of a sample in the batch and padded frames are ignored for loss and metric calculation.Experiments are run on two clusters with some further evaluation taking place locally. The first is The University of Edinburgh’s ML Teaching Cluster. Here, NVIDIA GPUs are used, mostly GTX Titan Xs (12GB VRAM) and RTX A6000s (48GB VRAM) depending on availability on the cluster. Resources have inconsistent availability. Therefore, some experiments are run on Eddie, The University of Edinburgh’s research compute cluster, on CPUs due to the lack of availability of GPUs. time or 1 hour 30 minutes of CPU time. This can vary up to 10 hours of CPU time for experiments with more expensive computations and larger input.

Of the learning rates tested, the best was found to be 0.001. If any lower, the model does not converge fast enough. If any higher, large gradient updates cause the validation accuracy to be noisy. Figures supporting this conclusion can be found in Appendix A.4. These figures also show that the validation loss does not increase after convergence. I conclude that the model does not have a propensity to overfit within 150 epochs, perhaps due to the random sampling of audio patches during training. Combined with the fact that training is relatively quick and that the model is only saved on improved validation loss, I proceed with 150 epochs of training without early stopping.

1: A subset of CRNN model results on the large vocabulary with different hyperparameters. The best results for each metric are in boldface. L is the length of training patches of audio in seconds, h and g are the hidden size and number of layers in the GRU respectively and k, c and ch are the kernel sizes, number of layers and number of channels in the CNN respectively. Models are ordered by their ‘Rank’, calculated by adding the model’s rank order over each metric and ordering by this total. Results across most hyperparameters are very similar. Comparing with the best results from the learning rate search in TableA.1, it seems that the parameters suggested by McFee and Bello[16] are good choices. Models with more parameters and longer input tend to perform worse, perhaps due to overfitting. This suggests that the model is learning something simple.

Korzeniowski and Widmer[45] train a deep CNN which remains competitive with state-of-the-art to this day. It contains 8 layers. Park et al.[42] find that the performance of this deep CNN is very similar to that of the CRNN, both reaching accuracies of around 65%. Training much deeper convolutional networks was found to be more computationally expensive than training the CRNN with little performance gain to be had. Therefore, I proceed with the CRNN for further experiments.

Korzeniowski and Widmer[45] train a deep CNN which remains competitive with state-of-the-art to this day. It contains 8 layers. Park et al.[42]

Korzeniowski and Widmer[45] train a deep CNN which remains competitive with state-of-the-art to this day. It contains 8 layers. Park et al.

Korzeniowski and Widmer[45]

Korzeniowski and Widmer

2 Finally, its training data was properly licensed, unlike Jukebox. I leave the results of Jukebox for future work. Details of the feature extraction process are left to Appendix A.18.As the representations are 2048-dimensional, it is computationally infeasible to feed this directly into the GRU. Instead, using fully connected layers, I project these vectors down to a power of 2, from 16 to 1024. The best representation is with 64 dimensions, although results show no clear trend. Results are left to Appendix A.20.

2 Finally, its training data was properly licensed, unlike Jukebox. I leave the results of Jukebox for future work. Details of the feature extraction process are left to Appendix A.18.

https://www.ultimate-guitar.com/

https://www.youtube.com/watch?v=oOlDewpCfZQ accessed 25th February 2025

https://github.com/tmc323/Chord-Annotations

https://www.hooktheory.com/

https://github.com/kmkurn/pytorch-crf

https://huggingface.co/docs/hub/en/index

https://github.com/bmcfee/pyrubberband

https://github.com/PierreRL/LeadSheetTranscription/blob/main/src/data/ synthetic_data/chord_sequence.py

📄 Read Full PDF on ArXiv