Smart Edition of MIDI Files

Smart Edition of Midi Files Pierre R oy and Franc ¸ ois P achet Spotify Abstract W e address the issue of e diting musical performance data, in particular Midi ﬁles representing hu- man musical performances. Editing such sequences r aises speciﬁc issues due to the ambiguous natur e of musical objects. e ﬁrst sour ce of ambiguity is that musicians naturally pr oduce many deviations from the metrical frame. ese deviations may be intentional or subconscious, but they play an im- portant role in conveying the groove or feeling of a performance. Relations between musical elements are also usually implicit, creating even more ambiguity . A note is in relation with the surrounding notes in many possible ways: it can be part of a melodic pattern, it can also play a harmonic role with the simultaneous notes, or be a pedal-tone. All these aspects play an essential r ole that should be preserved, as much as possible, when e diting musical sequences. In this paper , we contribute speciﬁcally to the pr oblem of editing non-quantized, metrical musical sequences represented as Midi ﬁles. W e ﬁrst list of number of pr oblems caused by the use of naive edition oper ations applied to performance data, using a motivating example. W e then introduce a model, called Dancing Midi , based on 1) two desirable, well-deﬁned pr operties for edit oper ations and 2) two well-deﬁned operations, split and conca t , with an implementation. W e show that our model formally satisﬁes the two properties, and that it prevents most of the problems that occur with naive edit operations on our motivating example, as well as on a real-world example using an automatic harmonizer . 1 Introduction e term music performance denotes all musical artefacts produced by one or more human musicians playing music, such as a pianist performing a scor e or acc ompanying a singer , a violin section playing an orchestr ation of a piece or a jazz musician improvizing a solo on a given lead sheet. Music performance can be repr esented in various ways, depending on the context of use: printed notation, such as scores or lead sheets, audio signals, or performance acquisition data, such as piano-rolls or Midi ﬁles. Each of these repr esentations captur es partial information about the music that is useful in certain contexts, with its own limitations [4]. Printed notation oers information about the musical meaning of a piece, with explicit note names and chord labels (in, e. g. , lead sheets), and precise metrical and structural information, but it tells little about the sound. Audio recor dings render timbre and expression accur ately , but provide no information about the score. Symbolic representations of musical performance, such as Midi , pr ovide precise timings ar e are therefor e well adapted to edit oper ations, either by humans or by soware. e need for e diting music al performanc e data arises from two situations. First, musicians oen nee d to edit performance data when producing a new piece of music. For instance, a jazz pianist may pla y an improvized version of a song, but this improvization should be edited to accommodate for a posteriori changes in the structur e of the song. e sec ond need comes from the rise of AI-based automatic music gener ation tools. ese tools usually work by analyzing existing human performance data to produce new ones (see, e.g. [3] for a survey). Whatever the algorithm used for learning and gener ating music, these tools 1 call for editing means that preserve as far as possible the expressiveness of original sources. W e address the issue of editing musical performanc e data represented as Midi ﬁles, while pr eserving as much as possible its semantics, in a sense deﬁned below . However , editing music performance data raises speciﬁc issues related to the ambiguous nature of musical objects. e ﬁrst source of ambiguity is that musicians pr oduce many temporal deviations from the metrical frame. ese deviations may be intentional or subconscious, but they pla y an important part in conveying the gr oove or feeling of a performance. Relations between musical elements are also usually implicit, creating even mor e ambiguity . A note is in relation with the surrounding notes in many possible wa ys: it can be part of a melodic pattern, it can also pla y a harmonic r ole with the simultaneous notes, or be a pedal-tone. All these aspects, although not explicitly represented in a Midi ﬁle, play an essential role that should be preserved, as much as pos sible, when editing such musical sequences. e Midi format is widespr ead in the instrument industry and Midi editors are commonplace, for in- stance in Digital Audio W orkstations. Par adoxically , the problem of editing Midi with semantic-pr eserving operations was not addressed yet, to our knowledge. Attempts to provide semantically-preserving edit operations have been made on the audio domain (e. g. [13]) but these are not transferr able to music per- formance data, as we explain below . In human-computer interaction, cut , copy and paste [11] are the Holy T rinity of data manipulation. ese three commands prove d to be so useful that they are now incorporated in virtually every soware, such as word processing, programming environments, graphics creation, photography , audio signal, or movie editing tools. Re cently , they have been extended to run across devices, enabling moving text or media from, for instance, a smartphone to a computer. ese operations are simple and have a clear, unambiguous semantics: cut , for instance, consists in selecting some data, say a word in a text, removing it from the text, and sa ving it to a clipboard for later use. Each type of data to be edited raises its own editing issues that led to the development of speciﬁc editing techniques. For instance, edits of audio signals usually requir e cross fades to pr event clicks. Simi- larly , in movie editing, fade-in and fade-out ar e used to pr event harsh tr ansitions in the imag e ﬂow . Edge detection alg orithms were developed to simplify object selection in imag e editing. e case of Midi data is no exception. Every note in a musical work is related to the preceding, succee ding, and simultaneous notes in the piece. Moreover , every note is related to the metric al structure of the music. In Section 2, we list a number of issues occurring when applying na ¨ ıve edition commands to a musical str eam. In this paper, we restrict ourself to a speciﬁc type of musical performance data: non-quantized, metri- cal music data, i.e. performances which are re corded with free expr ession but with a ﬁxed, known tempo. is includes most of MIDI ﬁles available on the web (for instance [9] ). is excludes MIDI ﬁles consisting of fr ee impr ovization, or music performance with no ﬁxed tempo. Note that these c ould also be included in the scope of our system, using automatic downbeat estimation methods, but we do not consider this case in this paper . It is not possible, to our knowledge, to deﬁne a precise semantics to musical performance in general. In this paper we contribute to the problem of editing non-quantized, metrical musical sequences repr esented as Midi ﬁles in the following wa y: 1. W e list of number of problems caused by the use of naive edition operations applied to performance data, using a motivating example; 2. W e then introduce a model, called Dancing Midi , based on 1) two desirable, well-deﬁned properties for edit operations and 2) two well-deﬁned operations, split and c onca t , with an implementation. ese primitives can be used to cr eate higher-level operations, such as cut , copy , or paste ; 2 3. W e show that our model formally satisﬁes the two properties; 4. W e show additionally that our model does not create most of the problems that occur with naive edit operations on our motivating example, as well as on a real-world example using an automatic harmonizer . 2 Motivating Example Figure 1: A piano roll with ﬁve measures extracted from a piece by Br ahms. Colors indicate note velocities (blue is so, gr een is medium, and brown is loud). A typic al edit operation: the goal is to cut the ﬁrst two beats of Measure 3 and insert them at the beginning of Measure 6. Figure 1 shows a piano roll representing ﬁve measures extracted from a Midi stream consisting of a performance capture of Johannes Brahms ’s Intermezzo in B  minor. Consider the problem of cutting the ﬁrst two beats of Measure 3 and inserting these two beats at the beginning of Measure 6. Figure 2 shows the piano roll produced when these operations ar e performed in a straightforw ard way , i.e. , when considering notes as mere time intervals. Notes that are played across the split temporal positions are segmented, leading to sever al musical inconsistencies. First, long notes, such as the highest notes, are split into sever al contiguous short notes. is alters the listening experience, as sever al attacks ar e heard, inste ad of a single one. Additionaly , the note velocities (a Midi equivalent of loudness) are possibly changing at each new attack, which is unmusical. Another issue is that splitting notes with no consideration of the musical context leads to creating excessively short note fr agments, which we call residuals , e. g. , at the bottom right in Figure 2. R esiduals are disturbing, especially if their velocity is high, and are somehow analogous to clicks in audio signals. Finally , a side-eect of this approach is that some notes are quantized (last two beats of Measure 5). As a result, slight temporal deviations present in the original Midi str eam are lost in the pr ocess. Such temporal deviations are important parts of the performance, as they convey the groove, or feeling of the piece, as interpr eted by the musician. Here is a list of musical is sues occurring when ra w editing a Midi stream: 1. Creation of r esiduals , i.e. , excessively brief notes; 2. Splitting long notes, creating superﬂuous attacks; 3. Creating surprising, inc onsistent changes in note velocities; 4. Losing small temporal deviations with respect to the metrical structure, leading to unnecessary , undesirable quantization. 3 Figure 2: Raw editing the piano roll in Figure 1 produces a poor musical result: long notes are split, residuals are cre ated, some notes are quantized, and note velocities are inconsistent. is piano r oll was obtained using Apple Logic Pro X Midi e ditor , using the “split” option, se e Section 3. Figure 3: Solving the Midi edition problem stated in Figure 1 using the model we present here. ere ar e no short note residuals, long notes ar e held and ther e ar e no harsh changes in velocities. Small temporal deviations are pr eserved (no quantization). is is to be compare d to Figure 2. Figure 3 shows another solution to the problem, obtained using the model presented in this article. Comparing Measure 5 in Figures 2 and 3 shows obvious dierences in the two approaches: none of the issues produc ed with ra w edits, shown in Figures 2, are pr esent in the piano roll shown in Figur es 3. 3 State of the Art LogicPro X, the Digital Audio W orkstation commercialized by Apple, features a full-feature d Midi editor. As shown in Figure 4, when editing a Midi stream, the decisions reg arding notes that overlap with the selected regions are le to the user who has to decide whether to split , sh orten (to ﬁt the region boundaries) or keep the notes. Figure 2 shows the piano r oll produce d using the ﬁrst option, split. In the latter case, ke ep, when pasting somewhere else, the notes will be put back with their original duration, even if it exceeds the region boundaries. is forces the user to decide explicitly what to do, and the decision applies to all notes, r egardless of the musical context. Besides, this strategy leads to cr eating overlapping notes, which create ambiguous situations as the Midi format does not have a wa y to handle overlapping notes with the same pitch. e piano roll panel in A vid Pro T ools, another major Digital Audio W orkstation, oers less control on Midi edits than Logic Pro X. Figure 5 shows the piano roll obtained with Pro T ools, when using the basic copy and paste functions on the motivating example. is piano roll is essentially the same as that in Figure 2, except note velocities ar e not displayed. 4 Figure 4: e menu that pops up when splitting a Midi stream with Apple Logic Pro X Digital Audio W orkstation. Here, we split at both ends of a region covering the ﬁrst two beats of Measure 3 (yellow zone). e system lets the user decide what they want explicitly regar ding notes that cros s the split region boundaries. Figure 5: e piano roll produce d when editing the motivating example using the piano roll module in Avid Pr o T ools. 5 4 e Model In this section, we present a model of temporal sequences that allows us to implement two primitives: split and conca t . e split primitive is use d to br eak a Midi stream (or Midi ﬁle) at a speciﬁed temporal position, yielding two Midi streams: the ﬁrst one contains the music pla yed before the split position and the second one contains the music played aer . e conca t operation takes two Midi stre ams as inputs and returns a single str eam by appending the se cond stre am to the ﬁrst one. e model is called Dancing Midi since the underlying technique (see Section 4.2) bears some similarity with the idea of Dancing Links that Donald K nuth developed in [6] High-level edit operations, such as copy , cut , paste , or insert , may be performed by applying split and conca t to the right stre ams. For example, to cut a Midi stream S between temporal positions t 1 and t 2 , we execute the sequence of primitive oper ations 1. ( S 0 , S r ) ← split ( S, t 2 ) , i.e. , we split S at t 2 ; 2. ( S l , S m ) ← split ( S 0 , t 1 ) , i.e. , we split S 0 at t 1 . 3. Return c onca t ( S l , S r ) ; e stream S m is removed from S and it could be stored in a Midi clipboard for later use. Similarly , to insert a Midi stream T in a stream S at temporal position t , one c an perform: 1. ( S 1 , S 2 ) ← split ( S, t ) , i.e. , split S at t ; 2. Return c onca t ( concat ( S 1 , T ) , S 2 ) . In our model, a continuous subdivision of time coexists with a discrete, regular subdivision of time, in, e. g. , beats or measures, which is equivalent to the metrical fr ame of a musical sequence. e constant distance between discrete temporal positions is denoted by δ . Musical events, e. g. , notes, rests, chords, may occur at any continuous temporal position. On the contr ary , split and concat are applicable only on discrete temporal positions, i.e. , multiples of δ . In short, musical events may be placed at any position, and take arbitrary dur ations, within a regularly dec omposed time frame. W e made the practical choices to represent time by integers and to consider only sequences start- ing at time 0 and whose duration is a multiple of the segmentation subdivision, which is denoted by δ . ese choices aim at simplifying the implementation and clarifying the presentation without limiting the gener ality of the model 1 . A (time) event e is deﬁned by its start and end times, denoted by e − and e + , two nonnegative integers, with e − ≤ e + . W e consider sequences of non-overlapping time events. A sequence S with a duration d ( S ) is an order ed list of time events E ( S ) = ( e 1 , . . . , e n ) , such that: • d ( S ) ≡ 0 (mo d δ ) , i.e. , the duration of S is a multiple of δ , • e + n ≤ d ( S ) , i.e. , all events are within the sequence, and • e + i ≤ e − i +1 , ∀ i = 1 , . . . , n − 1 , i.e. , there are no overlapping events. e set of all such sequences is denoted by S δ . e model handles sequences of non-overlapping time intervals (elements of S δ ), and therefor e can- not directly deal with Midi streams, which contain overlapping intervals ( e.g . , chor ds consist of three or 1 In Midi ﬁles, time is represente d by integer numbers, based on a predeﬁne d resolution , typically 960 ticks per beats. 6 more overlapping intervals). erefore, we decompose a Midi str eam into individual sequences of non- overlapping events, one for each unique pitch and unique Midi channel. W e show , in Section 4.5, how this approach applies to r eal Midi streams. 4.1 Pr oblem Statement W e address the problem of implementing split and concat in an ecient and sensible (from a musical viewpoint) way . e split primitive breaks a sequence at a speciﬁc temporal position and the conca t primitive returns a sequenc e formed by concatenating two sequences: split : S δ × { 0 , δ, 2 δ , . . . , d ( S ) } → S δ × S δ ( S, t ) 7→ ( S l , S r ) , where t is the segmentation position, S l is the le part of S , from 0 to t , hence d ( S l ) = t , and S r is the right part of S , aer position t , with d ( S r ) = d ( S ) − t and conca t : S δ × S δ → S δ ( S 1 , S 2 ) 7→ S = S 1 ⊕ S 2 . where S is a sequence of dur ation d ( S ) = d ( S 1 ) + d ( S 2 ) constructed by appending S 2 to S 1 . e types above specify that split and concat create sequences of S δ , i.e. , , sequences with no over- lapping events and with all events falling within the sequence’ s bounds. W e call residual an event whose duration is shorter than a predeﬁned threshold ε . W e deﬁne two properties for split and conca t : ( P1 ) split and concat are the in verse of one another , i.e. , a. ∀ S ∈ S δ , ∀ t ∈ { 0 , δ, 2 δ , . . . , d ( S ) } , conca t ( split ( S, t )) = S b. ∀ S, T ∈ S δ , split ( conca t ( S, T ) , d ( S )) = ( S, T ) ( P2 ) split and concat never cre ate residuals, i.e. , a. let S ∈ S δ , ∀ t ∈ { 0 , δ, 2 δ , . . . , d ( S ) } and ( S 1 , S 2 ) = split ( S, t ) , then ∀ e ∈ S 1 ∪ S 2 , d ( e ) < ε ⇒ e ∈ S , b. ∀ S, T ∈ S δ , ∀ e ∈ S ⊕ T , d ( e ) < ε ⇒ e ∈ S ∪ T . 2 Property ( P1 ) states that splitting a sequence and merging back the resulting sequences produces the original sequence and, conversely , concatenating two sequences and splitting them again at the same po- sition, returns the two original sequenc es. is is to ensure that no information is lost upon splitting and concatenating sequences. Additionally , as we show in Section 4.3, this property oers the beneﬁts of a powerful, gener alized undo mechanism. Property ( P2 ) ensur es that no residual is cr eated: the only residuals appearing in a sequence obtaine d using split or conca t were already in the original sequence(s). Note that the second part, i.e. , ( P2 )(b.), is a bit simpliﬁed, as we explain in Section 4.2.3. It is easy to design split and conca t primitives that satisfy either ( P1 ) or ( P2 ). However , as we will illustrate now , it is dicult to enforce ( P1 ) and ( P2 ). It is the combination of these two properties that ensures that no information is lost and that no r esidual is created. 2 is is a simpliﬁcation, as we explain in Section 4.2.3 7 Figure 6 shows a simple sequence S of duration d ( S ) = 20 , with a regular temporal subdivision δ = 10 time units, and containing event e , with e − = 8 and e + = 12 . W e consider the problem of splitting S in its middle ( t = 10 ) and concatenating back the two extracted sequenc es. Figure 6: A simple sequence with a single event (gray box) and a r egular subdivision of 10 time units. Assuming the threshold for brief events is set to ε = 3 time units, split and conca t should not cr eate events shorter than 3 . Figure 7 shows the result of splitting S into S 1 and S 2 and concatenating S 1 and S 2 , performed in a straightforwar d way by r aw event segmentation, i.e. , cut exactly at the speciﬁed position, and with no additional proces sing. e concatenation S 1 ⊕ S 2 is identical to the original sequence S , satisfying Property ( P1 ). However , this approach creates two events of duration 2 in the split sequenc es S 1 and S 2 , which violates the residual Pr operty ( P2 ) for ε = 3 . On Figure 8, on the contr ary , short fragments are omitted, by adding a simple ﬁlter, fulﬁlling Pr op- erty ( P2 ) as no r esiduals show up on split sequenc es S 1 and S 2 . However, S 1 ⊕ S 2 is empty , which violates the reversibility Pr operty ( P1 ). Figure 7: Straightforw ard approach: sequence S (top) is split at t = 10 , yielding sequences S 1 and S 2 (middle), with two r esiduals (events of dur ation 2), then S 1 and S 2 are concatenated resulting in S 1 ⊕ S 2 , which is identical to S , as expected. Figure 8: Applying the straightforwar d appr oach with a ﬁlter for residuals results in cr eating no residuals ( S 1 and S 2 are empty , as expected), but violates the reversibility as S 1 ⊕ S 2 6 = S . 8 is example suests that some memory is needed to implement the split and concat operations so they satisfy ( P1 ) and ( P2 ). W e show in the next section how to deﬁne this operations with a minimal amount of memory . 4.2 Model Implementation e implementation of the model is based on memory cells that store information about the events oc- curring at each segmentation position in a sequence. 4.2.1 Computing the Memory Cells For a given segmentation position t , for each event e containing t , i.e. , e − ≤ t ≤ e + , we compute the length of e that lies befor e t and the length of e aer t . ese two quantities are stored in two memory cells, a le and a right memory cell, as shown in Figure 9. e le cell stores only information related to events starting strictly before t and, conversely , the right cell stores information r elated to events ending strictly aer t . ere ar e ﬁve possible conﬁgur ations, which are shown in Figure 9. When the sequence is split at position t , the memory cells of S at position t ar e distributed to the resulting sequences: the le cell is associate d to the le sequence, at position t and the right cell is as sociated to the right sequence at position 0 . ese values will be used to concatenate these subsequences with other sequences, using the conca t operation. Algorithm 1 computes the two memory cells for a sequence and a speciﬁc segmentation position. Figure 9: e le and right memory cells at segmentation position t . Case 1: no interval, the memory is “ empty”; case 2: a le-touching interval of length 12, stored in the le memory cell; case 3: same as 2 except on the right and length is 5; case 4: two touching intervals, each stor ed in the corr esponding cell; case 5: a crossing interv al, the le and right memories store the interval’ s length before and aer t . 4.2.2 Implementation of split e split primitive, described by Algorithm 2, consists in distributing the intervals to the le and right sequences, S l and S r , as follows: intervals occurring before the segmentation position t are allocate d to S l and intervals occurring aer t are allocated to S r . e memory of the original sequence is copied to the resulting sequences in a similar way: memory cells store d before t ar e copied to the le sequenc e at 9 Algorithm 1 Compute ( l l , l r ) , ( r l , r r ) , the le and right memory cells of S at t . 1: proce dure ComputeMemory ( S , t ) 2: if t = 0 then  t is the start time of S 3: if ∃ e ∈ S such that e − = 0 then return (0 , 0) , (0 , d ( e )) 4: else return (0 , 0) , (0 , 0) 5: else if t = d ( S ) then  t is the end time of S 6: if ∃ e ∈ S such that e + = d ( S ) then return ( d ( e ) , 0) , (0 , 0) 7: else return (0 , 0) , (0 , 0) 8: else  t is within the sequence 9: if ∃ e ∈ S with e − < t < e + then return ( t − e − , e + − t ) , ( t − e − , e + − t ) 10: else if ∃ e 1 , e 2 ∈ S with e + 1 = e − 2 = t then return ( d ( e 1 ) , 0) , (0 , d ( e 2 )) 11: else if ∃ e ∈ S with e − = t then return (0 , 0) , (0 , d ( e )) 12: else if ∃ e ∈ S with e + = t then return ( d ( e ) , 0) , (0 , 0) 13: else return (0 , 0) , (0 , 0) the same position; memory cells of S stored aer t are copied to the right sequence, with oset − t as all sequences start at position 0 (see Se ction 4). A speciﬁc treatment is required at position t , to avoid creating residuals that could appear when splitting short intervals containing position t , in order to satisfy Property ( P2 )(a.). If an event e contains t , i.e. , e − < t < e + , we consider the memory cells l l , l r , r l , and r r stored in M ( S , t ) to decide whether an event of length l l should be added at the very end of the le sequence. e event is added if it is not a residual, i.e. , l l > ε or if it is a residual that already existed in S , i.e. , l r = 0 and l l > 0 . A similar treatment is applied to decide if an event of length r r is inserted at the very beginning of S r . Note that, doing this, we ignore the actual events, and only consider the memory . is is reﬂected in lines 16 and 18 in Algorithm 2. Sequence S l is such that the space between t − l l and t is either empty of event or it contains a single event [ t − l l , t ] , possibly le-trimmed so it does not start before 0 : e + ≤ t − M ( S l , t ) .l l , ∀ e ∈ S l or [max { 0 , t − M ( S l , t ) .l l } , t ] ∈ S l (1) and S r is such that the space between 0 and r r is either empty of event or it contains a single event [0 , r r ] , possibly right-trimmed so it does not exce ed the sequence’ s end time: e − ≥ M ( S r , 0) .r r , ∀ e ∈ S r or [0 , min { M ( S r , 0) .r r , d ( S r ) } ] ∈ S r (2) Note that the wa y the memory is computed, see Algorithm 1, ensures that (1) and (2) are satisﬁed by any sequence befor e it is split. 4.2.3 Implementation of concat Concatenating two sequences S 1 and S 2 creates a new sequence S , whose duration d ( S ) = d − 1 + d 2 , with d 1 = d ( S 1 ) and d 2 = d ( S 2 ) . All events of S 1 that end strictly before d 1 are added to S , as well as all events of S 2 that start strictly aer 0 . e only delicate operation is to decide what to do at position d 1 in S . In Section 4.2.2, we have seen that all sequences satisfy (1) and (2). erefore, to concatenate S 1 and S 2 , we need to consider four c ases: 10 Algorithm 2 Split S at t . 1: proce dure ComputeMemory ( S , t )  δ is the regular subdivision of time; d ( S ) = nδ ; t = mδ with 0 < m < n . 2: S l ← empty sequence with d ( S l ) = t 3: S r ← empty sequence with d ( S r ) = d ( S ) − t  copies the memory cells fr om S 4: for i = 0 , . . . , m do 5: M ( S l , iδ ) ← M ( S , iδ ) 6: for i = m, . . . , n do  translate by − t , as S r starts at 0 7: M ( S r , iδ − t ) ← M ( S , iδ ) 8: for e ∈ S do 9: if e + ≤ t then 10: add e to S l 11: else if e − ≥ t then 12: add event [ e − − t, e + − t ] to S r 13: else  e satisﬁes e − < t < e + 14: ( l l , l r ) , ( r l , r r ) ← M ( S, t )  shortcut notations 15: if l l ≥ ε or ( l r = 0 and l l > 0 ) then 16: add event [max(0 , t − l l ) , t ] to S l 17: if r r ≥ ε or ( r l = 0 and r r > 0 ) then 18: add event [(0 , min( d ( S ) , r r )] to S r 19: return S l , S r  d ( S l ) = t = mδ and d ( S r ) = d ( S ) − t 1. e + ≤ t − M ( S 1 , d 1 ) .l l , ∀ e ∈ S l and e − ≥ M ( S 2 , 0) .r r , ∀ e ∈ S 2 , i.e. , S 1 has no event overlapping with the temporal segment deﬁned by M ( S 1 , d 1 ) .l l , its le-memory at d 1 and, similarly , S 2 has no event overlapping with the temporal segment deﬁned by M ( S 2 , 0) .r r , its right-memory at 0 ; 2. e + ≤ t − M ( S 1 , d 1 ) .l l , ∀ e ∈ S l and [0 , min { M ( S 2 , 0) .r r , d ( S 2 ) } ] ∈ S 2 , i.e. , S 1 has no event overlapping with the temporal segment deﬁned by M ( S 1 , d 1 ) .l l , its le-memory at d 1 and S 2 has an event occupying the temporal segment deﬁne d by M ( S 2 , 0) .r r , its right-memory at 0 ; 3. [max { 0 , t − M ( S 1 , d 1 ) .l l } , t ] ∈ S l and e − ≥ M ( S 2 , 0) .r r , ∀ e ∈ S 2 has an event occupying the temporal segment deﬁned by M ( S 1 , d 1 ) .l l , its le-memory at d 1 and S 2 has no event overlapping with the temporal segment deﬁned by M ( S 2 , 0) .r r , its right-memory at 0 ; 4. [max { 0 , t − M ( S 1 , d 1 ) .l l } , t ] ∈ S l and [0 , min { M ( S 2 , 0) .r r , d ( S 2 ) } ] ∈ S 2 , i.e. , S 1 has an event occupying the temporal segment deﬁne d by M ( S 1 , d 1 ) .l l , its le-memory at d 1 and, similarly , S 2 has an event occupying the temporal segment deﬁne d by M ( S 2 , 0) .r r , its right-memory at 0 ; ese four c ases c orrespond to the four if statements in Algorithm 3 at lines 18, 31, 33, and 35 respectively . Case 1. e space delimited by the memories of S 1 and S 2 does not contain any event. e question is to decide whether an event should be cre ated, based on the values of the memories of S 1 and S 2 . If the memories of S 1 and S 2 are identical, i.e. , l l = r l and l r = r r , then we add the event e = [max { 0 , d 1 − l l } , min { d 1 + d 2 , d 1 + r r } ] . 11 W e add this event, regardles s of its duration, even if it is very small, i.e. , l l + r r < ε , be cause such an short event was present at some point, before S 1 and S 2 were obtained by splitting a longer sequence. W e must add this event to ensure that Pr operty ( P1 ) is not violated. See Figure 10. On the contr ary , if the memories of S 1 and S 2 dier , the choice will depend on the values stored in the memories of S 1 and S 2 . ere ar e several cases to consider: If l l = 0 or r r = 0 , we do nothing, as no events are recorde d around the concatenation position. Otherwise, that is when l l > 0 and r r > 0 , we know that S 1 (resp. S 2 ) had originally an event containing position d ( S 1 ) (resp. 0 ), which disappeared aer a split operation. W e will create an additional event e = [max { 0 , d 1 − l l } , min { d 1 + s 2 , d 1 + r r } ] , if d ( e ) = min { d 1 + d 2 , d 1 + r r } − max { 0 , d 1 − l l } ≥ ε , to avoid cre ating a residual. Case 2. e space delimited by the memories of S 1 and S 2 contains an event e starting at the onset of S 2 . e question is whether e should start earlier , i.e. , “in S 1 ” , depending on the memories of S 1 and S 2 . In this case, we start e at e − = max { 0 , d 1 − l l } , another option is to set e − = max { 0 , d 1 − l l , d 1 − r l } . In the second option, the algorithm will try to create an attack for the corresponding note based on the memory of S 2 . Both options are equally valid, the intuition is that favoring the memory of the le sequence tends to preserve tempor al deviations in attacks as they appear in the le sequenc e. Conversely , favoring the memory of the right sequence tends to replicate deviations of attacks as they appear in the right sequence. Case 3. e space delimited by the memories of S 1 and S 2 contains an event e ending at the end of S 1 . e question is whether e should end later , “in S 2 ” , depending on the memories of S 1 and S 2 . One possible implementation is to systematically end e at e + = min { d 1 + d 2 , d 1 + r r } . Another option is to start the event at e + = min { d 1 + d 2 , d 1 + r r , d 1 + l r } . Here again, both options are equally valid. e intuition is that favoring the memory of the le (resp. right) sequence tends to pr eserve temporal deviations in note dur ations as they appear in the le (resp. right) sequence. Case 4. e space delimited by the memories of S 1 and S 2 contains an event ending at the end of S 1 and another event starting at the onset of S 2 . In this case, if l r > 0 and r l > 0 , we merge the two events in a single one, otherwise, we do nothing. 12 Algorithm 3 Concatenates S 1 and S 2 . 1: proce dure Concat ( S , t )  m is the integer such that d ( S ) = mδ 2: d 1 ← d ( S 1 ) and d 2 ← d ( S 2 ) 3: S ← empty sequence with d ( S ) = d 1 + d 2 4: for i = 0 , . . . , m do  Copies memory from S 1 and S 2 , except at d 1 5: if iδ < d 1 then M ( S, iδ ) ← M ( S 1 , iδ ) 6: if iδ > d 1 then M ( S, iδ ) ← M ( S 2 , iδ − d 1 ) 7: M ( S, d 1 ) .l l ← M ( S 1 , d 1 ) .l l  l l at the end of S 1 8: M ( S, d 1 ) .l r ← M ( S 1 , d 1 ) .l r  l r at the end of S 1 9: M ( S, d 1 ) .r l ← M ( S 2 , 0) .r l  r l at the start of S 2 10: M ( S, d 1 ) .r r ← M ( S 2 , 0) .r r )  r r at the start of S 2 11: for e ∈ S 1 s.t. e + < d 1 do  events ending before d 1 are added 12: add e to S 13: for e ∈ S 2 s.t. e − > 0 do  events starting aer d 1 are added 14: add event ( e − + d 1 , e + + d 1 ) to S 15: ( l l , l r ) , ( r l , r r ) ← M ( S, d 1 )  shortcut notations 16: P 1 ← S 1 is empty or S 1 [ − 1] + ≤ d 1 − l l  S 1 [ − 1] denotes the last event in S 1 17: P 2 ← S 2 is empty or S 2 [0] + ≥ r r  S 2 [0] denotes the ﬁrst event in S 1 18: if P 1 and P 2 then 19: if l l = r l and l r = r r and l l > 0 then  same as initial situation for S 1 and S 2 20: add [max { 0 , d 1 − l l } , min { d 1 + d 2 , d 1 + r r } ] to S  restor e memory , reg ardless of ε 21: else 22: if l l = 0 and r r = 0 then 23: do nothing  no memory to consider 24: else if l l = 0 and r r > 0 and r l = 0 then 25: add e = [ d 1 , min { d 1 + d 2 , d 1 + r r } ] to S  e ∈ S 2 was not cr eated by splitting 26: else if r r = 0 and l l > 0 and l r = 0 then 27: add e = [ max { 0 , d 1 − l l } , d 1 ] to S  e ∈ S 1 was not cr eate by splitting 28: else  ambiguous case 29: if min { d 1 + d 2 , d 1 + r r } − max { 0 , d 1 − l l } ≥ ε then 30: add e = [ max { 0 , d 1 − l l } , min { d 1 + d 2 , d 1 + r r } ] to S  add e if not residual 31: else if P 1 and not P 2 then 32: add [ max { 0 , d 1 − l l } , d 1 + S 2 [0] + ] to S  extend S 2 [0] in the past (starts before d 1 ) 33: else if not P 1 and P 2 then 34: add [ S 1 [0] − , min { d 1 + d 2 , d 1 + r r } ] to S  extend S 1 [ − 1] in the future (ends aer d 1 ) 35: else 36: if l r > 0 and r l > 0 then 37: add [ S 1 [ − 1] − , d 1 + S 2 [0] + ] to S  merg e S 1 [ − 1] and S 2 [0] 38: else 39: add S 1 [ − 1] to S  add both S 1 [ − 1] and S 2 [0] 40: add [ d 1 + S 2 [0] − , d 1 + S 2 [0] + ] to S return S 13 Figure 10: Splitting a sequence S as t = 10 with a residual at t . e resulting sequences S 1 and S 2 are empty , but memorize the residual (lightgray hatched boxes). When concatenating them, Algorithm 3 will recr eate the residual. 4.3 e Model Satisﬁes Properties (P1) and (P2) Properties ( P1 ) and ( P2 ) respectively ensure that the model enables temporal sequence editing without creating undesired residuals and with the guaranty that editing is non-destructive. erefore the model naturally provides an undo mechanism: if two sequences that originally formed a single sequence are concatenated together in their original position, this will result in the original sequence. But this undo mechanism is mor e g eneral as even if the two sequences wer e use d for other intermediate edit operations, it will always be possible to recreate the initial sequence. is, combined with the possibility to copy sequences, gives rise to powerful sequenc e edition tools. is is illustrated on Midi streams in Section4. 5. e model relies on a memory structure compute d at each segmentation position. is memory c on- sists of two memory cells, r epresenting the le-hand and right-hand sides of the sequence at the speciﬁed position. e essential invariant in the implementation of the split and conca t primitives is that • the le-hand side memory of a sequence at its end position is never modiﬁed and • the right-hand side memory of a sequence at position 0 is never modiﬁed. erefor e, arbitr ary editing the sequence never leads to any information loss, as the model alwa ys remem- bers information about the initial state of a sequence at its starting and end positions. is is what mak es it possible to satisfy Properties ( P1 ) and ( P2 ). W e will not fully demonstrate that the model satisﬁes Pr operties ( P1 ) and ( P2 ). From the description of Alg orithms 2 and 3, it is cle ar that Property ( P2 ) on residuals is satisﬁed, with the subtle c ase discussed in the ﬁh case of Algorithm 3 (see Section 4.2.3). e Property ( P1 ) on reversibility is easily checked in most cases. e only tricky case is when residuals are present in the edited sequences. A full proof requir es reviewing all possible conﬁgur ations, which would be too long. However, the discussion in Sec- tion 4.2.3 covers the most dicult case of split residuals which need to be re created when concatenating two sequences that wer e originally forming a single sequence. 4.4 Extending the Model e model may be extended easily to handle additional musical information. A ﬁrst extension is to asso- ciate each event in a se quence to some metadata. In the c ase of Midi ﬁles, it is natural to stor e the velocity and the channel of a note-on event with the corresponding event. is is straightforward to implement by 14 storing the metadata in the memory cells computed by Algorithms 1, and by associating the best metadata to newly create d events in Algorithms 2 and 3. T echnically , each event e is associated to a value v ( e ) , which may represent any metadata. A time event e is therefor e a 3-tuple [ e − , e + , v ( e )] . e only modiﬁcations to the algorithms ar e: 1. in Algorithm 1, each memory c ell is a 3-tuple, e. g. , the le memory cell for event e is [ l l , l r , v ] ; 2. in Algorithm 2, in line 12, we add event [ e − − t, e + − t, v ( e )] instead of simply [ e − − t, e + − t ] , similarly in line 16, we add [ e − , t, v ( e )] and in line 18, we add [0 , e + − t, v ( e )] ; 3. in Algorithm 3, events receive the value stored in the corresponding memory; when we merge two events, we systematically chose to k eep the metadata associated to the le event. is decision is arbitrary and we could decide otherwise, e.g . , use the memory of the right sequence, compute an “ averag e” value. In our model, residuals are deﬁned by an absolute threshold duration ε . It is also natural to use a relative deﬁnition for residuals. For instance, when splitting a sequence at position t , if an event e is such that e − = t − a and e + = t + b with b  a > ε , depending on the musical context, one may want to dismis s the part of e occurring before t (or length a ), although it is not a residual in the absolute sense as a > ε . A typical case is a measur e containing a long, whole note starting slightly before the bar line. When splitting at the bar line, the head of the note should be removed, as it obviously belongs to the measur e starting at t , not to the measure ending at t . It is quite easy to extend the model to handle relative r esiduals, although it mak es Algorithms 2 and 3 a bit longer , which is why it is not r eported her e. T echnically , the head (or tail) of a split event is considered a residual if its dur ation is shorter than ε or if it is shorter than a c ertain r atio of the length the whole event. T ypical values for the ratio r ange from 1/10 (a fragment shorter than a third of the original event is considered too small to exist on its own) to 1/3. is modiﬁcation leads to a slight increase in the complexity of Algorithms 2 and 3 to ensure that the model still satisﬁes Properties ( P1 ) and ( P2 ). 4. 5 Using the Model on Actual Music Performance As we said in Section 4, the model handles sequences of non-overlapping events, and as such is not directly applicable to Midi ﬁles. However , in a Midi ﬁle, for a given Midi pitch and a given Midi channel, the successive note-on and note-o events form a sequence of non-overlapping time intervals. erefor e, the model is applicable to Midi ﬁles if we treat each note (ﬁxed pitch and channel) individually . e targ eted use of the model is to edit a Midi ﬁle capturing a musical performanc e, and therefor e with non- quantized note events, but r ecorded with a speciﬁe d tempo. e tempo may change during the pie ce, but we will consider it ﬁxed here for the sake of clarity . e model is well-adapted to edit such Midi ﬁles with segmentation points set as a fraction of the meter , e.g . , one beat, one bar , half-a-beat. e split and conca t primitives were implemented with this application in mind. e wa y we use the memory when concatenating sequences is to ensure that temporal deviations from the metrical structure of the music are preserved as much as possible, without creating disturbing residuals, and with the ﬂexibility of a powerful undo mechanism. W e describe here applications of our model to two real world examples. 4. 5.1 T ransforming a 4/4 into a 3/4 Music Piece Consider the piano roll in Figure 11, which shows eight measures of a Midi capture of a piano performance by French pianist Lionel Gaget, in the style of American pianist Chick Corea. e time signature of this 15 Figure 11: A piano roll showing ﬁrst eight measures of a Midi c apture of a piano performance, in the style of American jazz pianist Chick Corea. Figure 12: A ra w edit of the Midi shown in Figure 11 obtained by removing the fourth beat of every measure, resulting in a version with a 3/4 time signatur e with many musical issues (in red boxes): some long notes are split, r esiduals are present, note velocities ar e inconsistent, and many notes are quantized. piece is 4/4. W e aim at producing a 3/4 version of this piece by removing the last beat of every measure. W e created two versions: Figure 12 shows a piano roll of the resulting Midi str eam when performing raw edits and Figure 13 shows the r esulting music when performing edits with our model. e version in Figure 12 exhibits many musical issues that make it necessary to manually edit the resulting piano r oll to preserve the groove of the original music. Here is a list of some of these issues: • Split long notes, such as the B2 between measur es 2 and 3; • Residuals: A2 beginning of m. 4, B2 beginning of m. 5, loud E3 at end of m. 5, etc. • Quantized notes: heads of B1 and F  3 beginning of m. 2, tails of E1 and B1 at end of m. 6, etc. • Inconsistent velocities: the velocity of the split B2 chang es between m. 2 and m. 3. e piano roll in Figure 13, on the other hand, does not have any of the issues of the ra w edit version, and, as a result, sounds mor e natural and preserves the style of original music mor e convincingly . 4. 5.2 Harmonization Re cent pr ogress in Artiﬁcial Intelligence ha ve led to powerful music generation systems, especially in the symbolic domain [2]. Computers have become extremely ecient at creating brief musical fragments in many styles [12, 5], but no algorithm has yet captured the art of spontane ously arr anging musical material into longer convincing structures, such as songs. For instance, in the composition proces s using Flow- Machines [8], human musicians ar e in char ge of selecting and organizing musical material suested by 16 Figure 13: A Dancing Midi edit of the Midi shown in Figure 11 obtained by removing the fourth beat of every measur e, resulting in a version with a 3/4 time signatur e with none of the music al issues mentione d (compare white boxes here with red boxes in Figure 12). Many notes are held acros s segmentation points, creating a smoother musical output. Figure 14: e harmonization of “ Autumn Leaves ” as written in the lead sheet fr om measure 20 to measur e 33. e list of chords visible on the ﬁgure is Gm (me asures 20-21), Cm7 , F7 , B  maj7 (mm. 24-25), Am7/  5, D7/9, Gm7 , F  7 , Fm7 , E7 , E  maj7 , D7/  9, and Gm (mm. 32-33) using a jazz notation for chord symbols, in a default voicing. the computer to create large-scale structures conveying a sense of direction. ese new ways of making music highlight the need for powerful tools to edit music al material. One of the main strength of the Midi format is that it enables musical tr ansformations, such as pitch- shiing (or transposition ) and time-stretching, with no loss of quality . e model is naturally compatible with transpositions because the memory c ells do not depend on the actual pitch of a note. is can be il- lustrated by integr ating our model within any g enerative algorithm. As an illustration, we ha ve developed a system that produc es harmonizations for a given targ et Midi ﬁle in the style of a source Midi ﬁle. e harmonizer outputs a new Midi ﬁle with the same dur ation as the targ et ﬁle, cr eated by editing the source ﬁle, using split and conca t combined with chromatic tr anspositions, in such a wa y that the output ﬁle’ s harmony matches that of the target Midi ﬁle. e harmonizer uses Dynamic Programming [1] or similar techniques, such as Belief Propag ation [7], to produce a ﬁle that optimizes the harmonic similarity with the target ﬁle and, at the same time, minimizes the edits in the source ﬁle, to preserve the style of the source as much as pos sible. W e do not fully describe the harmonizer in this article [10]. Figures 15-16 show piano rolls obtained by harmonizing 14 measures of “ Autumn Leaves” (see Figure 14) using a Midi capture of a piano performance of an intermezzo by Johannes Brahms. e ra w edit in Figures 15 exhibits the same type of artifacts as Figure 12. Long notes are split at each segmentation 17 Figure 15: Harmonization of the 14 measures of “ Autumn Leaves ” shown in Figure 14 in the style of a piano performance of an intermeeezo by Johannes Brahms and using ra w Midi edits. e actual harmony is visible in the background (light gr een notes). Musical inconsistencies ar e indicated by red boxes. Figure 16: e same harmonization as in Figure 15 except all edits are performed with the model. is piano-roll has none of the musical is sues of the piano roll in Figure 15. point (here, every beat), which creates dissonant chords at measures 24-25. More generally , the style of the resulting Midi ﬁle is substantially dierent from that of the original source. e source contains numerous long-held notes and most chords are arpeiate d. In the Midi ﬁle on Figures 15 long-held notes are brok en at every beat, making most chords non-arpeiated (notes are played all together). Many residuals are introduced, and some of them make up chords (red boxes in measure 23 and 31) that sound wrong. On the contrary , Figure 16 shows a much cleaner result, with none of these musical issues. 5 Conclusion W e have presented a model for editing non-quantized, metrical musical sequences represented as Midi ﬁles. W e ﬁrst listed a number of problems caused by the use of naive edition operations applied to per- formance data, using a motivating example. W e then introduced a model, called Dancing Midi based on 1) two desirable, well-deﬁned properties for edit operations and 2) two well-deﬁned operations, split and conca t , with an implementation. W e showed that our model formally satisﬁes the two properties, and that our model does not create most of the problems that occur with naive edit operations on our motivating example, as well as on a real-world example using an automatic harmonizer . Our approach has limitations. First, the model requires two par ameters ( ε and the relative r atio), which have to be set by 18 the user . Assigning these parameters to 0.15 beats and 20% respectively turned out to work well for most music we had to deal with. However , ther e are cases where these parameters should be tuned, especially for non typical music (e. g. extreme tempos or very dense). More generally , the model is monophonic by nature, so it does not make any inference on groups of notes (e.g. chords). is may produce, in r are cases, strang e behavior (lik e treating one note of a chord dier ently than others). However the reversibility of the operations (no undo mechanism is requir ed by construction) and the light weight nature of the model (complexity is linear in the number of segmentation points) makes it worth using in many cases as it clearly improves on naive e diting. acknowledg ement W e thank Jonathan Donier for fruitful remarks and comments. R eferences [1] Bellman, R. Dynamic Progr amming , 1 ed. Princeton University Pr ess, Princeton, NJ, USA, 1957 . [2] Briot , J.-P . , Hadjeres, G. , and P achet, F . Deep le arning te chniques for music g eneration-a survey . arXiv preprint arXiv:1709.01620 (2017). [3] Briot , J.-P . , and P achet, F . Deep learning for music generation: challenges and dire ctions. Neur al Computing and Applications (2018), 1–13. [4] D annenberg, R. B. Music Representation Issues, T echniques, and Systems. Computer Music Journal 17 , 3 (1993), 20–30. [5] Hadjeres, G. , P achet, F . , and Nielsen, F . DeepBach: a Steerable Model for Bach Chorales Gener- ation. In International Conference on Machine Learning (2017), pp. 1362–1371. [6] Knuth, D . E. Dancing links. eprint arXiv:cs/0011047 (Nov . 2000). [7] P ap adopoulos, A. , P achet , F . , R oy, P . , and Sakellariou, J. Exact sampling for regular and markov constraints with belief propagation. In Principles and Practice of Constraint Progr amming (Cham, 2015), G. P esant, Ed. , Springer International Publishing, pp. 341–350. [8] P ap adopoulos, A. , R o y, P . , and P achet, F . Assisted Lead Sheet Composition Using FlowCom- poser . In Principles and Practice of Constraint Progr amming (Cham, 2016), M. Rueher , Ed. , Springer International Publishing, pp. 769–785. [9] Raffel, C. e lakh midi dataset v0.1, 2016. [10] R o y, P . , P achet , F . , and Carr ´ e, B. e MIDI Harmonizer. eprint arXiv:to-appear (Nov . 2018). [11] T esler, L. A Personal History of Modeless T ext Editing and Cut/Copy-P aste. interactions 19 , 4 (July 2012), 70–75. [12] V an Den Oord, A. , Dieleman, S. , Zen, H. , Simony an, K. , V inyals, O . , Gra ves, A. , Kalchbren- ner, N. , Senior, A. , and Ka vuk cuoglu, K. W aveNet: A Generative Model for Ra w Audio. arXiv preprint arXiv:1609.03499 (2016). 19 [13] W hitt aker, S. , and Amento, B. Semantic speech editing. In Procee dings of the SIGCHI conference on Human f actors in computing systems (2004), ACM, pp. 527–534. 20

Smart Edition of MIDI Files

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment