Learning Transposition-Invariant Interval Features from Symbolic Music and Audio

LEARNING TRANSPOSITION-INV ARIANT INTER V AL FEA TURES FR OM SYMBOLIC MUSIC AND A UDIO Stefan Lattner 1 , 2 , Maarten Grachten 1 , 2 , Gerhard Widmer 1 1 Institute of Computational Perception, JKU Linz 2 Sony Computer Science Laboratories (CSL), P aris, France ABSTRA CT Many music theoretical constructs (such as scale types, modes, cadences, and chord types) are deﬁned in terms of pitch interv als—relati ve distances between pitches. There- fore, when computer models are employed in music tasks, it can be useful to operate on interv al representations rather than on the raw musical surface. Moreov er, interv al rep- resentations are transposition-in variant, v aluable for tasks like audio alignment, cov er song detection and music struc- ture analysis. W e employ a gated autoencoder to learn ﬁxed-length, in vertible and transposition-in variant interv al representations from polyphonic music in the symbolic do- main and in audio. An unsupervised training method is proposed yielding an or ganization of interv als in the repre- sentation space which is musically plausible. Based on the representations, a transposition-in variant self-similarity ma- trix is constructed and used to determine repeated sections in symbolic music and in audio, yielding competitive re- sults in the MIREX task ”Discovery of Repeated Themes and Sections”. 1. INTR ODUCTION The notion of relative pitch is important in music under- standing. Many music theoretical concepts, such as scale types, modes, chord types and cadences, are deﬁned in terms of relations between pitches or pitch classes. But relativ e pitch is not only a music theoretical construct. It is common for people to perceive and memorize melodies in terms of pitch intervals (or in terms of contours , the upward or downw ard direction of pitch intervals) rather than sequences of absolute pitches. This characteristic of music perception also has ramiﬁcations for the perception of form in musical works, since it implies that transposi- tion of some musical fragment along the pitch dimension (such that the relati ve distances between pitches remain the same) does not alter the perceived identity of the musical material, or at least establishes a sense of similarity be- tween the original and the transposed material. As such, adequate detection of musical form in terms of (approxi- c  Stefan Lattner , Maarten Grachten, Gerhard Widmer . Li- censed under a Creativ e Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Stefan Lattner, Maarten Grachten, Gerhard W idmer . “Learning transposition-inv ariant interval features from sym- bolic music and audio”, 19th International Society for Music Information Retriev al Conference, Paris, France, 2018. mately) repeated structures presupposes the ability to ac- count for pitch transposition—one of the most common types of transformations found in music. Relativ e pitch perception in humans is currently not well- understood [13]. For example there are no established the- ories on ho w the human brain deriv es a relati ve representa- tion of pitch from the tonotopic representations formed in the cochlea, neither is it clear whether there is a connection between the perception of pitch relations in simultaneous versus consecuti ve pitches. Computational approaches to address tasks of music un- derstanding (such as detecting patterns and form in music) often circumvent this issue by representing musical stim- uli as sequences of monophonic pitches, after which sim- ply differencing consecutiv e pitches yields a relative pitch representation. This approach also w orks for polyphonic music, to the extent that the music can be meaningfully segre gated into monophonic pitch streams. A drawback of this approach is that it presupposes the ability to segre- gate musical streams, which is often far from tri vial due to the ambiguity of musical contexts. T o take an analogous approach on acoustical representations of musical stimuli is ev en more challenging, since it further depends on the ability to detect pitches and onsets in sound. In this paper we take a dif ferent approach altogether . W e train a neural network model to learn representations that represent the relation between the music at some time point t and the preceding musical context. During train- ing, these representations are adapted to minimize the re- construction error of the music at t given the preceding context and the representation itself. A crucial aspect of the model is its bilinear architec- ture (more speciﬁcally , a gated autoencoder , or GAE ar- chitecture) in volving multiplicativ e connections, which f a- cilitates the formation of relativ e pitch representations. W e stimulate such representations more explicitly using an al- tered training procedure in which we transpose the training data using arbitrary transpositions. The result are two models (for symbolic music and au- dio) that can map both monophonic and polyphonic music to a sequence of points in a vector space—the mapping space —in a way that is inv ariant to pitch transpositions. This means that a musical fragment will be projected to the same mapping space trajectory independently of how it is transposed. W e validate our approach experimentally in several ways. First we show that musical fragments that are nearest neigh- bors in the mapping space have many pitch intervals in common (as opposed to nearest neighbors in the input space). Then we show that the topology of the learned mapping space reﬂects musically meaningful relations between in- tervals (such as the tritone being dissimilar to other in- tervals). Lastly we use mapping space representations to detect musical form both for symbolic and audio repre- sentations of music, showing that it yields competitive re- sults, and in the case of audio ev en improv es the state of the art. A re-implementation of the transposition-inv ariant GAE for audio is publicly av ailable 1 . The paper is structured as follo ws. Section 2 provides an overvie w of relation learning using GAEs, and revie ws work on creating interval representations from music. In Section 3, the used architecture is described and in Sec- tion 4, data is introduced on which the GAE is trained. The training procedure, including the no vel method to sup- port the emergence of transposition-in variance, is proposed in Section 5. The experiments conducted to examine the properties of learned mappings are described in Section 6, and results are presented and discussed in Section 7. Sec- tion 8 wraps the paper up with conclusions and prospects of future work. 2. RELA TED WORK GAEs utilize multiplicative inter actions to learn correla- tions between or within data instances. The method w as inspired by the correlation theory of the brain [32], where it was pointed out that some cogniti ve phenomena cannot be explained with the con ventional brain theory and an exten- sion w as proposed which in v olves the correlation of neural patterns. In machine learning, this principle was deployed in bi- linear models, for example to separate person and pose in face images [30]. Bi-linear models, like the GAE, are two-factor models whose outputs are linear in either fac- tor when the other is held constant. [26] proposed another variant of a bi-linear model in order to learn objects and their optical ﬂow . Due to its similar architecture, the gated Boltzmann machine (GBM) [17, 18] can be seen as a direct predecessor of the GAE. The GAE w as introduced by [14] as a deriv ativ e of the GBM, as standard learning criteria became applicable through the development of denoising autoencoders [31]. GAEs have been further used to learn transformation- in variant representations for classiﬁcation tasks [15], for parent-offspring resemblance [5], for learning to negate adjectiv es in linguistics [27], for activity recognition with the Kinekt sensor [22], in robotics to learn to write num- bers [6], and for learning multi-modal mappings between action, sound, and visual stimuli [7]. In music, bi-linear models hav e been applied to learn co-variances within spectrogram data for music similarity estimation [28], and for learning musical transformations in the symbolic domain [9]. In sequence modeling, the GAE has been utilized to learn co-variances between sub- 1 see https://github.com/SonyCSLParis/cgae- invar sequent frames in movies of rotated 3D objects [16] and to predict accelerated motion by stacking more layers in order to learn higher-order deriv atives [21], which uses a method similar to the one proposed here. T ransposition-in variance in music is achiev ed in [20] by transforming symbolic pitch–time representations into point-sets, in which translatable patterns are identiﬁed. An- other method in the symbolic domain is that in [2], where a general interval representation for polyphonic music is put forward, in [24], where speciﬁc pitch-class interv als in polyphonic music are used for characterizing music styles and in [23] where transposition-in variant self-similarity ma- trices are computed. In [12], an approach to calculating transposition-in variant mid-level representations from au- dio is introduced, based on the 2-D power spectrum of melodic fragments. Similarly , a method to calculate inter- pretable interval representations from audio is proposed in [33], where chromagrams that are close in time are cross- correlated to obtain local pitch-in variance. 3. MODEL Let x j be a v ector representing pitches of currently sound- ing notes (in the symbolic domain) or the energy distributed ov er frequency bands (in the audio domain), in a ﬁxed- length time interval. Given a temporal context x t t − n = x t − n . . . x t as the input and the next time step x t +1 as the target, the goal is to learn a mapping m t which does not change when shifting x t +1 t − n up- or downwards in the pitch dimension. A gated autoencoder (GAE, depicted in Fig- ure 1) is well-suited for this task, modeling the interv als between reference pitches in the input and pitches in the target, encoded in the latent variables of the GAE as map- ping codes m j . Unlike in common prediction tasks, the targets are kno wn when training a GAE. The goal of the training is to ﬁnd a mapping m j for any input/target pair which transforms the input into the giv en target. The map- ping at time t is calculated as m t = σ h ( W 1 σ h ( W 0 ( Ux t t − n · Vx t +1 ))) , (1) where U , V and W k are weight matrices, σ h is the hyper - bolic tangent non-linearity , and we will refer to the learnt mappings m j as the mapping space of the input/target pairs. The operator · (depicted as a triangle in Figure 1) depicts the Hadamard (or element-wise) product of the ﬁlter re- sponses Ux t t − n and Vx t +1 , denoted as factors . This op- eration allo ws the model to r elate its inputs, making it pos- sible to learn interval representations. The target of the GAE can be reconstructed as a func- tion of the input x t t − n and a mapping m t : ˜ x t +1 = σ g ( V > ( W > 0 W > 1 m t · Ux t t − n )) , (2) where σ g is the sigmoid non-linearity for binary input and the identity function for real-valued input. The cost function is deﬁned to penalize the error of re- constructing the target x t +1 giv en the input x t t − n and the Figure 1 : Schematic illustration of the gated autoencoder architecture used in the experiments. mapping m t as L c = c ( x t +1 , ˜ x t +1 ) , (3) where c ( · ) is the mean-square error for real-valued sequen- ces and the cross-entropy loss for binary sequences. 4. D A T A W e train the model both on symbolic music representations and on audio spectrograms. F or the symbolic data, the Mozart/Batik data set [35] is used, consisting of 13 piano sonatas containing more than 106,000 notes. The dataset is encoded as successiv e 60 dimensional binary vectors (en- coding MIDI note number 36 to 96 ), each representing a single time step of 1/16th note duration. The pitch of an activ e note is encoded as a corresponding on-bit, and as multiple voices are encoded simultaneously , a vector may hav e multiple acti ve bits. The result is a pianoroll-like rep- resentation. The audio dataset consists of 100 random piano pieces of the MAPS dataset [8] (subset MUS), at a sampling rate of 22.05 kHz. W e choose a constant-Q transformed spec- trogram using a hop size of 1984 , and Hann windo ws with different sizes depending on the frequency bin. The range comprises 120 frequency bins (24 per octave), starting from a minimal frequency of 65 . 4 Hz. Each time step is contrast- normalized to zero mean and unit variance. 5. TRAINING The model is trained with stochastic gradient descent in order to minimize the cost function (cf. Equation 3) us- ing the data described in Section 4. Howe ver , rather than using the data as is, we use data-augmentation in combina- tion with an altered training procedure to explicitly aim at transposition in variance of the mapping codes. 5.1 Enfor cing T ransposition-In variance As described in Section 3 the classical GAE training pro- cedure deriv es a mapping code from an input/target pair, and subsequently penalizes the reconstruction error of the target given the input and the deriv ed mapping code. Al- though this procedure naturally tends to lead to similar mapping codes for input target pairs that ha ve the same in- terval relationships, the training does not e xplicitly enforce such similarities and consequently the mappings may not be maximally transposition in variant. Under ideal transposition in variance, by deﬁnition the mappings would be identical across different pitch transpositions of an input/target pair . Suppose that a pair ( x t t − n , x t +1 ) leads to a mapping m (by Equation 1). T rans- position in variance implies that reconstructing a target x 0 t +1 from the pair ( x 0 t t − n , m ) should be as successful as recon- structing x t +1 from the pair ( x t t − n , m ) when ( x 0 t t − n , x 0 t +1 ) can be obtained from ( x t t − n , x t +1 ) by a single pitch trans- position. Our altered training procedure explicitly aims to achie ve this characteristic of the mapping codes by penalizing the reconstruction error using mappings obtained from trans- posed input/target pairs. More formally , we deﬁne a trans- position function shift ( x , δ ) , shifting the v alues of a v ector x of length M by δ steps (MIDI note numbers and CQT frequency bins for symbolic and audio data, respecti vely): shift ( x , δ ) = ( x (0+ δ ) mo d M , . . . , x ( M − 1+ δ ) mod M ) > , (4) and shift ( x t t − n , δ ) denotes the transposition of each single time step vector befor e concatenation and linearization. The training procedure is then as follo ws. First, the mapping code m t of an input/target pair is inferred as shown in Equation 1. Then, m t is used to reconstruct a trans- posed version of the target, from an equally transposed in- put (modifying Equation 2) as ˜ x 0 t +1 = σ g ( V > ( W > 0 W > 1 m t · U shift ( x t t − n , δ ))) , (5) with δ ∈ [ − 30 , 30] for the symbolic, and δ ∈ [ − 60 , 60] for the audio data. Finally , we penalize the error between the reconstruction of the transposed tar get and the actual transposed target (i.e., emplo ying Equation 3) as L ( shift ( x t +1 , δ ) , ˜ x 0 t +1 ) . (6) The transposition distance δ is randomly chosen for each training batch. This method amounts to both, a form of guided training and data augmentation. Some weights (i.e., ﬁlters) in U and V resulting from that training are depicted in Figure 2. 5.2 Architectur e and T raining Details The architecture and training details of the GAE are as fol- lows: A temporal context length of n = 8 is used (the choice of n > 1 leads to higher robustness of the mapping codes to diatonic transposition). The factor layer has 1024 units for the symbolic data, and 512 units for the spec- trogram data. Furthermore, for all datasets, there are 128 neurons in the ﬁrst mapping layer and 64 neurons in the second mapping layer (resulting in m t ∈ R 64 ). Figure 2 : Some ﬁlter pairs ∈ { U , V } of a GAE trained on polyphonic Mozart piano pieces. L2 weight re gularization for weights U and V is ap- plied, as well as sparsity regularization [11] on the top- most mapping layer . The de viation of the norms of the columns of both weight matrices U and V from their av- erage norm is penalized. Furthermore, we restrict these norms to a maximum v alue. W e apply 50% dropout on the input and no dropout on the target, as proposed in [14]. The learning rate (1e-3) is gradually decremented to zero ov er the course of training. 6. EXPERIMENTS In this Section we describe sev eral experimental analyses to validate the proposed approach. They are intended to test the degree of transposition-inv ariance of the learned mappings, as well as assess their musical relev ance (Sec- tions 6.1 and 6.3). Finally , we put the learned represen- tations to practice in a repeated section discovery task for symbolic music and audio (Section 6.2). 6.1 Classiﬁcation and Cluster Analysis Our hypothesis is that the model learns relative pitch rep- resentations (i.e. intervals) from polyphonic absolute pitch sequences. In order to test this hypothesis, we conduct two experiments using the symbolic data. In the ﬁrst experiment a ten-fold k-nn classiﬁcation of intervals is performed (where k = 10), where the task is to identify all pitch intervals between notes in the input and the target of an input/target pair . If the learned mappings actually represent interv als, the classiﬁer will perform sub- stantially better on the mappings than on the input space. As intervals in music are transposition-in variant, the inter- val labels do not change when performing transposition in the input space. Thus, we perform the classiﬁcation on the mappings of the original data and of randomly trans- posed data, to test if the mappings are indeed transposition- in variant. W e label the symbolic train data input/target pairs ac- cording to all interv als which occur between them, inde- pendent of the temporal distance of the notes exhibiting the interv als. Thus, each pair can hav e multiple labels. For each pair in the test set the k-nn classiﬁer predicts the set of interval labels that are present in the k neighbors of that pair . The classiﬁcation is performed in the input space (us- ing concatenated pairs) and in the mapping space. Using these predictions we determine the precision, recall, and Data Precision Recall F1 Original input Mapping space 91.27 70.25 76.66 Input space 65.58 46.05 50.59 T ransposed input Mapping space 90.78 71.44 77.31 Input space 51.81 32.99 37.43 All 26.40 100.0 40.05 None 0.0 0.0 0.0 T able 1 : Results of the k-nn classiﬁcation in the map- ping space and in the input space for the original symbolic data and data randomly transposed by [ − 24 , 24] semi- tones. “ All” is a lo wer bound (always predict all interv als), “None” returns the empty set. F-score over the test set (cf. T able 1). For example, when a pair contains 6 intervals and the classiﬁer estimate yield 4 true-positiv e and 4 false-positi ve interv al occurrences, that pair is assigned a precision of 0.5 and a recall of 0.67. In the second part of the e xperiment, the cluster cen- ters of all intervals in the mapping space are determined. Again, each pair projected into the mapping space accounts for all intervals it exhibits and can therefore participate in more than one cluster . The mutual Euclidean distances be- tween all cluster centers are displayed as a matrix (cf. Fig- ure 3). An interpretation of the results follows in Section 7. 6.2 Discovery of Repeated Themes and Sections The MIREX T ask for Discovery of Repeated Themes and Sections for Symbolic Music and Audio 2 tests algorithms for their ability to identify repeated patterns in music. The commonly used JKUPDD dataset [3] contains 26 motifs, themes, and repeated sections annotated in 5 pieces by J. S. Bach, L. v . Beethoven, F . Chopin, O. Gibbons and W . A. Mozart. W e use the MIDI and the audio versions of the dataset and preprocess them as described in Section 4. W e calculate the reciprocal of the Euclidean distances between all representations m t of a song, resulting in a transposition-in variant similarity matrix X . Then, the val- ues of the main diagonal are set to the minimal value of the matrix. Subsequently , the matrix is normalized and con- volv ed with an identity matrix of size 15 × 15 to empha- size and smooth diagonals (Figure 4 shows a resulting ma- trix). The method used to determine repeated parts based on diagonals of high values in the self-similarity matrix is adopted from [25], with a different method to identify di- agonals, as described below . The function s ( i, j, N ) = N X k = N − m X ( i + k , j + k ) w k m (7) returns the score for a diagonal starting at X ( i, j ) with 2 http://www.music- ir.org/mirex/wiki/2017: Discovery_of_Repeated_Themes_&_Sections Figure 3 : Distance matrix of cluster centers of interv als represented in mapping space. Darker cells indicate higher distances between respecti ve clusters, brighter cells indi- cate closeness. length N , and diagonals with high score are considered to be repeated sections. For each i, j , we iterativ ely ev aluate the score with N increasing from 1 in integer steps, until the score undercuts a threshold γ . Only the last m values, m = min(10 , N ) , of the diagonal are taken into account, because those values indicate when to stop tracing. The factor w k = 1 + k + m − N m (8) linearly weights the last m v alues of the diagonal so that later values ha ve more impact on the o verall score. Three empirically determined parameters inﬂuence the functioning of the method: (1) from the diagonals found, we only keep those spanning more than 2 whole notes , (2) all sections whose common boundaries start and end within the length of a half note are considered to be repe- titions of each other, (3) the thresholds γ determining if a diagonal should be considered a repetition in the symbolic and the audio data are set to 0 . 9 and 0 . 81 , respecti vely . The results are sho wn in T able 2 and are discussed in Section 7. 6.3 Sensitivity Analysis The sensiti vity of the model to speciﬁc context informa- tion pro vides important insights into the functioning of the model. A common way of determining a networks sen- sitivity is by calculating the absolute v alue of the gradi- ents of the networks predictions with respect to the input, holding the network parameters ﬁxed [29]. Figure 5 sho ws the sensiti vity of the model with respect to the temporal context. The model is particularly sensitiv e to note oc- currences at t ∈ { 0 , − 3 , − 7 } . This sho ws that the most informativ e notes for a prediction are direct predecessors ( t = 0 ), and notes which occur a quarter ( t = − 3 ) and a Figure 4 : Symbolic music and corresponding self- similarity matrix calculated from transposition-in variant mapping codes. W armer colors indicate similarity , colder colors indicate dissimilarity . half note ( t = − 7 , i.e., eight sixteenth notes) before the prediction. 7. RESUL TS AND DISCUSSION The results of the k-nn classiﬁcation on the ra w data and on representations learnt by the model are shown in T able 1. Classiﬁcation in the mapping space appreciably outper- forms classiﬁcation in the input space, and obtains similar values for mappings of the original data and the randomly transposed data. In contrast, when performing classiﬁca- tion in the input space the results deteriorate for the ran- domly transposed input and do not exceed the theoretical lower bound (i.e, always predict all intervals). As the reg- ister and keys of the original data are limited, correlations between absolute and relativ e pitch exist. When transpos- ing the input, the classiﬁer cannot make use of these abso- lute cues for relativ e pitch any more and performs weakly in the input space. Figure 3 indicates which intervals are close to each other in the mapping space. An obvious regularity are the slightly brighter k-diagonals (i.e. parallels to the main diagonal) with k ∈ {− 24 , − 12 , 12 , 24 } , showing that two pitch in- tervals lead to similar mapping codes when they result in the same pitch class, such as the intervals +8 and -4 semi- tones, or -7 and -19 semitones. This is an indication that Algorithm F est P est R est F o(.5) P o(.5) R o(.5) F o(.75) P o(.75) R o(.75) F 3 P 3 R 3 T ime (s) Symbolic GAE intervals (ours) 59.07 77.60 58.30 68.92 80.24 67.46 77.51 91.38 73.29 50.44 60.36 53.23 127 VMO symbolic [34] 60.79 74.57 56.94 71.92 79.54 68.78 75.98 75.98 75.99 56.68 68.98 53.56 4333 SIARCT -CFP [4] 33.70 21.50 78.00 76.50 78.30 74.70 - - - - - - - COSIA TEC [19] 50.20 43.60 63.80 63.20 57.00 71.60 68.40 65.40 76.40 44.20 40.40 54.40 7297 A udio GAE intervals (ours) 57.67 67.46 59.52 58.85 61.89 56.54 68.44 72.62 64.86 51.61 59.60 55.13 194 VMO deadpan [34] 56.15 66.80 57.83 67.78 72.93 64.30 70.58 72.81 68.66 50.60 61.36 52.25 96 SIARCT -CFP [4] 23.94 14.90 60.90 56.87 62.90 51.90 - - - - - - - Nieto [25] 49.80 54.96 51.73 38.73 34.98 45.17 31.79 37.58 27.61 32.01 35.12 35.28 454 T able 2 : Different precision, recall and f-scores (adopted from [34], details on the measures are giv en in [3]) of different methods in the Discovery of Repeated Themes and Sections MIREX task, for symbolic music and audio. The F 3 score constitutes a summarization of all measures. , six t een th not es Figure 5 : Absolute sensitivity of the model when look- ing backwards on the temporal context, averaged over the whole dataset. the model learns the phenomenon of octave equiv alence, ev en if the input to the model represents only absolute pitch. Another distinct feature is the stripe which is orthog- onal to the main diagonal (i.e. where y = − x ). This indi- cates that the model dev elops some notion of relativ e dis- tances, by positioning intervals of the same distance (but different signs) close to each other . Note also that the mappings of certain intervals, notably 6 and − 6 , are distant to those of most other interv als (dark horizontal and vertical lines). This likely reﬂects the fact that tritone intervals are rare in diatonic music, and is fur- ther evidence of the musical signiﬁcance of the learned mappings. T able 2 shows results of the repeated themes and sec- tion discov ery task, where the F 3 score is a good indi- cator for the ov erall performance of the models (see [3] for a thorough explanation on the respective measures). For the audio data, the current state-of-the-art F 3 score was raised from 50 . 60 to 51 . 61 by our proposed method. The method performs slightly worse on the symbolic data, which is counterintuitive at ﬁrst sight, gi ven that results of other models suggest that this task is easier . Our hy- pothesis is that for discov ery of repeated sections, approx- imate matching leads to better results than exact compar- ison, simply because musical variation goes beyond chro- matic transposition (to wards which our model is in variant). For approximate matching, a spectrogram representation is better suited than symbolic vectors, as notes are blurred ov er more than one frequency bin, and harmonics may pro- vide additional cues for a similarity estimation. The pro- posed approach is computationally ef ﬁcient, because the diagonal detector (cf. Equations 7 and 8) is rather sim- ple and the transposition-in variance of the representations does not require explicit comparison of mutually trans- posed musical textures. 8. CONCLUSION AND FUTURE WORK In this paper we have presented a computational approach to deriving (pitch) transposition-in variant vector space rep- resentations of music both in the symbolic and the audio domain. The representations encode pitch intervals that occur in the music in a musically meaningful way , with tri- tone intervals—a rare interval in diatonic music—leading to more distinct representations, and octaves leading to more similar representations. Furthermore, the temporal sensitivity of the model reveals a beat pattern that shows in- creased sensiti vity to pitch interv als occurring at beat mul- tiples of each other . The transposition-inv ariance of the representations makes it possible to detect transposed repetitions of mu- sical sections in the symbolic and in the spectral domain of audio. W e ha ve demonstrated that this is beneﬁcial in tasks such as the MIREX task Disco very of Repeated Themes and Sections . A simple diagonal ﬁnding approach on a transposition-in variant self-similarity matrix produced by our model is sufﬁcient to outperform the state of the art in the audio version of the task. W e belie ve it is worthwhile to further explore the utility of transposition-in v ariant music representations for other applications, including speech recognition, music summa- rization, music classiﬁcation, transposition-inv ariant mu- sic alignment (including a cappella voices with pitch drift), query by humming, fast melody-based retriev al in large au- dio collections, and music generation. First results show that the proposed representations are useful for audio-to- score alignment [1] and for music prediction tasks [10]. 9. A CKNO WLEDGMENTS This research was supported by the EU FP7 (project Lrn2Cre8, FET grant number 610859), and the European Research Council (project CON ESPRESSIONE, ERC grant number 670035). W e thank Oriol Nieto for providing us with the source code of his experiments [25]. 10. REFERENCES [1] Andreas Arzt and Stefan Lattner . Audio-to-score align- ment using transposition-in variant features. In Pr o- ceedings of the 19th International Society for Music Information Retrieval Confer ence, ISMIR 2018, P aris, F rance , September 23-27 , 2018. [2] Emilios Cambouropoulos. A general pitch interval rep- resentation: Theory and applications. J ournal of New Music Resear ch , 25(3):231–251, 1996. [3] T om Collins. Discov ery of repeated themes and sections. http:// www.music- ir.org/ mirex/ wiki/2017:Discovery_of_Repeated_ Themes_%26_Sections , 2017. [4] T om Collins, Andreas Arzt, Sebastian Flossmann, and Gerhard Widmer . Siarct-cfp: Improving precision and the discovery of inexact musical patterns in point-set representations. In ISMIR , pages 549–554, 2013. [5] Afshin Dehghan, Enrique G Ortiz, Ruben V illegas, and Mubarak Shah. Who do i look like? determining parent-offspring resemblance via gated autoencoders. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 1757–1764, 2014. [6] Alain Droniou, Serena Iv aldi, and Olivier Sigaud. Learning a repertoire of actions with deep neural net- works. In IEEE International Joint Confer ences on Development and Learning and Epigenetic Robotics (ICDL-Epir ob) , pages 229–234. IEEE, 2014. [7] Alain Droniou, Serena Iv aldi, and Olivier Sigaud. Deep unsupervised network for multimodal percep- tion, representation and classiﬁcation. Robotics and Autonomous Systems , 71:83–98, 2015. [8] V alentin Emiya, Roland Badeau, and Bertrand David. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE T ransactions on Audio, Speech, and Language Pr o- cessing , 18(6):1643–1654, 2010. [9] Stefan Lattner and Maarten Grachten. Learning trans- formations of musical material using gated autoen- coders. In Pr oceedings of the 2nd Confer ence on Com- puter Simulation of Musical Cr eativity , CSMC 2017, Milton K eynes, UK, September 11-13, 2017 , 2017. [10] Stefan Lattner , Maarten Grachten, and Gerhard W id- mer . A predicti ve model for music based on learned in- terval representations. In Pr oceedings of the 19th Inter- national Society for Music Information Retrieval Con- fer ence, ISMIR 2018, P aris, F rance, September 23-27 , 2018. [11] Honglak Lee, Chaitan ya Ekanadham, and Andre w Y . Ng. Sparse deep belief net model for visual area V2. In John C. Platt, Daphne K oller , Y oram Singer , and Sam T . Roweis, editors, Pr oceedings of the T wenty-F irst Annual Conference on Neural Informa- tion Pr ocessing Systems, V ancouver , British Columbia, Canada, December 3-6, 2007 , pages 873–880. Curran Associates, Inc., 2007. [12] Matija Marolt. A mid-lev el representation for melody- based retrie v al in audio collections. IEEE T ransactions on Multimedia , 10(8):1617–1625, 2008. [13] Josh McDermott and Andrew Oxenham. Music per- ception, pitch, and the auditory system. Curr ent Opin- ion in Neur obiology , 18:1–12, 2008. [14] Roland Memisevic. Gradient-based learning of higher- order image features. In IEEE International Confer- ence on Computer V ision (ICCV), 2011 , pages 1591– 1598. IEEE, 2011. [15] Roland Memisevic. On multi-view feature learning. In John Langford and Joelle Pineau, editors, Pr oceed- ings of the 29th International Confer ence on Machine Learning (ICML-12) , ICML ’12, pages 161–168, New Y ork, NY , USA, July 2012. Omnipress. [16] Roland Memise vic and Georgios Exarchakis. Learning in variant features by harnessing the aperture problem. In ICML (3) , pages 100–108, 2013. [17] Roland Memisevic and Geoffre y Hinton. Unsupervised learning of image transformations. In IEEE Confer ence on Computer V ision and P attern Recognition, 2007. CVPR. , pages 1–8. IEEE, 2007. [18] Roland Memisevic and Geof frey E Hinton. Learn- ing to represent spatial transformations with factored higher-order Boltzmann machines. Neural Computa- tion , 22(6):1473–1492, 2010. [19] David Meredith. Cosiatec and siateccompress: Pattern discov ery by geometric compression. In International Society for Music Information Retrieval Confer ence , 2013. [20] David Meredith, Kjell Lemstr ¨ om, and Geraint A W ig- gins. Algorithms for disco vering repeated patterns in multidimensional representations of polyphonic music. Journal of Ne w Music Resear ch , 31(4):321–345, 2002. [21] V incent Michalski, Roland Memisevic, and Kishore K onda. ”modeling deep temporal dependencies with recurrent grammar cells”. In Advances in neural infor- mation pr ocessing systems , pages 1925–1933, 2014. [22] Decebal Constantin Mocanu, Haitham Bou Ammar, Dietwig Lowet, K urt Driessens, Antonio Liotta, Ger- hard W eiss, and Karl T uyls. Factored four way con- ditional restricted Boltzmann machines for activity recognition. P attern Recognition Letters , 66:100–108, 2015. [23] Meinard M ¨ uller and Michael Clausen. Transposition- in variant self-similarity matrices. In Simon Dixon, David Bainbridge, and Rainer T ypke, editors, Pr o- ceedings of the 8th International Confer ence on Music Information Retrieval, ISMIR 2007, V ienna, Austria, September 23-27, 2007 , pages 47–50. Austrian Com- puter Society , 2007. [24] Eita Nakamura and Shinji T akaki. Characteristics of polyphonic music style and markov model of pitch- class interv als. In T om Collins, David Meredith, and Anja V olk, editors, Mathematics and Computation in Music - 5th International Confer ence, MCM 2015, London, UK, J une 22-25, 2015, Pr oceedings , v olume 9110 of Lecture Notes in Computer Science , pages 109–114. Springer , 2015. [25] Oriol Nieto and Morwaread M Farbood. Identifying polyphonic patterns from audio recordings using music segmentation techniques. In Pr oc. of the 15th Interna- tional Society for Music Information Retrie val Confer- ence , pages 411–416, 2014. [26] Bruno A Olshausen, Charles Cadieu, Jack Culpep- per , and David K W arland. Bilinear models of natural images. In Electronic Imaging 2007 , pages 649206– 649206. International Society for Optics and Photon- ics, 2007. [27] Laura Rimell, Amandla Mabona, Luana Bulat, and Douwe Kiela. Learning to negate adjecti ves with bi- linear models. EACL 2017 , page 71, 2017. [28] Jan Schlueter and Christian Osendorfer . Music simi- larity estimation with the mean-covariance restricted Boltzmann machine. In 10th International Conference on Machine Learning and Applications and W orkshops (ICMLA), 2011 , v olume 2, pages 118–123. IEEE, 2011. [29] Karen Simonyan, Andrea V edaldi, and Andre w Zisser- man. Deep inside conv olutional networks: V isualising image classiﬁcation models and saliency maps. arXiv pr eprint arXiv:1312.6034 , 2013. [30] Joshua B T enenbaum and W illiam T Freeman. Sepa- rating style and content with bilinear models. Neural Computation , 12(6):1247–1283, 2000. [31] Pascal V incent, Hugo Larochelle, Isabelle La- joie, Y oshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful rep- resentations in a deep network with a local denois- ing criterion. J ournal of Machine Learning Resear ch , 11(Dec):3371–3408, 2010. [32] C V on der Malsburg. The correlation theory of brain function reprinted in e. domani, jl van hemmen and k. schulten (eds.), models of neural networks ii, 1981. [33] Thomas C W alters, David A Ross, and Richard F L yon. The interv algram: an audio feature for large- scale melody recognition. In Pr oc. of the 9th Interna- tional Symposium on Computer Music Modeling and Retrieval (CMMR) . Citeseer , 2012. [34] Cheng-i W ang, Jennifer Hsu, and Shlomo Dubnov . Music pattern discovery with variable markov oracle: A uniﬁed approach to symbolic and audio represen- tations. In Meinard M ¨ uller and Frans W iering, edi- tors, Proceedings of the 16th International Society for Music Information Retrieval Confer ence, ISMIR 2015, M ´ alaga, Spain, October 26-30, 2015 , pages 176–182, 2015. [35] Gerhard W idmer . Discov ering simple rules in com- plex data: A meta-learning algorithm and some surprising musical discoveries. Artiﬁcial Intellig ence , 146(2):129–148, 2003.

Learning Transposition-Invariant Interval Features from Symbolic Music and Audio

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment