Exploring Conditioning for Generative Music Systems with Human-Interpretable Controls

Performance RNN is a machine-learning system designed primarily for the generation of solo piano performances using an event-based (rather than audio) representation. More specifically, Performance RNN is a long short-term memory (LSTM) based recurre…

Authors: Nicholas Meade, Nicholas Barreyre, Scott C. Lowe

Exploring Conditioning for Generative Music Systems with   Human-Interpretable Controls
Exploring Conditioning f or Generativ e Music Systems with Human-Interpr etable Contr ols Nicholas Meade ∗ ,1 , Nicholas Barreyr e ∗ ,1 , Scott C. Lowe 1,2 , Sageev Oor e 1,2 1 Faculty of Computer Science, Dalhousie Uni versity , Halifax, NS, Canada 2 V ector Institute, T oronto, ON, Canada nicholas.meade@dal.ca, nbarreyre@dal.ca, scottclo we@gmail.com, sagee v@dal.ca Abstract Performance RNN is a machine-learning system designed primarily for the generation of solo piano performances us- ing an ev ent-based (rather than audio) representation. More specifically , Performance RNN is a long short-term memory (LSTM) based recurrent neural network that models poly- phonic music with expressi ve timing and dynamics ( Oore et al. , 2018 ). The neural network uses a simple language model based on the Musical Instrument Digital Interface (MIDI) file format. Performance RNN is trained on the e-Piano Ju- nior Competition Dataset ( International Piano e-Competition , 2018 ), a collection of solo piano performances by expert pi- anists. As an artistic tool, one of the limitations of the origi- nal model has been the lack of useable controls. The standard form of Performance RNN can generate interesting pieces, but little control is provided ov er what specifically is gener- ated. This paper e xplores a set of conditioning-based controls used to influence the generation process. Introduction Computational, automated, and stochastic generation of mu- sic are pursuits of long-standing interest ( Hedges , 1978 ); re- cently there has been an increasing body of research interest in these subfields ( Huang et al. , 2019a ; Dieleman, v an den Oord, and Simonyan ; Herremans, Chuan, and Chew , 2017 ; Huang et al. , 2019b ; Payne , 2019 ; Roberts et al. , 2018 ). In a typical auto-regressiv e language model, the system generates a discrete probability distrib ution P ( ev ent 0 ) , sam- ples from that distribution, and then uses its own sampled ev ent history to condition the probability distribution over the next ev ent to be predicted. An RNN model with a finite vocab ulary is continually predicting P ( ev ent t | ev ent i 0 , the neu- rons will have a negati ve preactiv ation whenever c 0 = 0 , rendering them entirely inactiv e under a ReLU activ ation function. Hence if the fraction of training samples where c 0 = 0 is insuf ficiently high, they will fail to break the sym- metry before the model finds a local optima, at which many of the neurons are permanently dead for all samples with c 0 = 0 . W e found better results could be achie ved simply by omit- ting the c 0 bit from the conditioning vector during training. This model can be conceptualised as learning a boosting pro- cedure. A “baseline model” is learnt which is used when the control signal is unknown and the conditioning signal is set to c i = 0 ∀ i > 0 . But when the conditioning signal is non- zero, the weights w ( j ) ci are used to make fine-tuning improve- ments to the baseline model by increasing or decreasing the activ ation of each neuron in the first LSTM layer . As a con- sequence, the residual error of the baseline model is reduced when the control signal is av ailable. For one-hot control signals, we also found good results by using a uniform distrib ution across c i when the true label was unkno wn (again, omitting a c 0 term). Major -Minor Conditioning The key of an excerpt of mu- sic is informati ve with regards to the pitches one would ex- pect to dominate within the music, both in terms of number of occurrences and emphasis. By extracting the key signa- tures that were present in performance titles, we were able to provide a control signal to the model corresponding to the key of the piece. Ideally , we would like to condition on specific key signa- tures, such as A minor and C major . Howe ver , only a small fraction of the 2750 performances had key signature infor- mation within their titles, so we grouped the keys into two clusters: major keys and minor ke ys. There were numerous performances which contained nei- ther “major” nor “minor” in their title. Because of this, our first implementation was a vector with three flags, major → [1 , 0 , 0] , minor → [0 , 1 , 0] , unknown → [0 , 0 , 1] . Howe v er , as described in the pre vious section on encoding partially labelled control signals, this was unsuccessful. At generation time, the model was only able to generate mu- sic of comparable quality to the original, non-conditioned Performance RNN model if the control signal was set to [0 , 0 , 1] . This was evidence for our aforementioned con- jecture that an “unknown flag” is a poor way to represent sparsely annotated data. T empo Keyw ord Conditioning W e extracted rele vant keyw ords from the titles pertaining to the tempo of the piece, and placed them into 5 tempo groups where tempos within the same group were more or less synonymous. In addition, three expert musicians labelled the tempo of some additional pieces in our dataset. The tempos which we considered were adagio, allegretto, allegro, andante, and presto. Counts for each group are indicated in T able 2 . T empo Count Adagio 57 Andante 131 Allegretto 160 Allegro 457 Presto 96 T otal labelled 901 Unlabelled 1849 T able 2: Number of samples for each tempo, extracted from the titles of the pieces. W e attempted several different representations for tempo keyw ords. First we tried a one-hot representation over the tempo groups with two additional components: one was a flag that indicating a mixed tempo and the other was a flag indicating whether the tempo was unknown. Again, this rep- resentation’ s performance was underwhelming in practice, so the flags were remov ed and a zero vector was used for samples with unknown tempo. Results for the latter implementation were significantly better , especially in combination with (stochastic) beam search. Most con vincingly , tempo controls can be interpo- lated (i.e. from fast to slo w) at generation time and there is clear correspondence in the time of the music. For example, Sample 4 demonstrates an example of generation that starts at adagio (very slo w) and is conditioned on presto (very fast) at the very end. While the performances were not always pleasing to the ear , they followed the general trend of the giv en control. The tempo of a piece indicates its pace or speed. Although tempo also conv eys more nuanced information about the tex- ture of the piece, broadly speaking, each tempo can be said to correspond to a certain number of beats per minute. 0 5 1 0 1 5 2 0 2 5 3 0 3 5 N o t e D e n s i t y ( n o t e s / s e c o n d ) 0 . 0 0 0 0 . 0 2 5 0 . 0 5 0 0 . 0 7 5 0 . 1 0 0 0 . 1 2 5 0 . 1 5 0 0 . 1 7 5 0 . 2 0 0 D e n s i t y G r o u n d Tr u t h : A d a g i o G e n e r a t e d : A d a g i o G r o u n d Tr u t h : A l l e g r e t t o G e n e r a t e d : A l l e g r e t t o G r o u n d Tr u t h : P r e s t o G e n e r a t e d : P r e s t o Figure 2: The distributions of note density for adagio, al- legretto, and presto for both the generated samples and the ground-truth. For e valuation purposes, as a proxy for the speed of a sam- ple, we considered the note density — the av erage number of notes played per second — which can be ev aluated compu- tationally . W e in vestig ated the distribution of note densities generated by our model when conditioned on each of the tempos that occurred in the training set. W e split each piece in the dataset into 30 s segments, and computed the note density for each segment. The note den- sity was determined as the total number of note onset ev ents in the segment, di vided by its duration ( 30 s ). F or each tempo conditioning value, we generated 800 samples each of 45 s duration. Each of these samples was cropped 3 to a final length of 30 s , over which we again computed the note density . W e also generated 800 samples from an uncondi- tional (“vanilla”) Performance RNN model trained without the tempo conditional signal. The results (sho wn in Figure 2 ) demonstrate the model learns the relationship between the tempo conditioning sig- nal and the speed of the piece (in terms of notes per second). Furthermore, the distribution of note densities are similar for the conditionally generated samples as to the ground-truth distributions from the source dataset. Form K eyword Conditioning Indicators of musical form were also present in the titles of many performances. The forms we extracted, and the sample counts for each, are listed in T able 3 . Similarly to the tempo keywords, we used this information to condition the model during training. Results were only obtained using the unknown-flag ap- proach and the outcome was poor . Again, this provided e vi- dence that such a paradigm does not work in practice. Fur- ther experiments are required to confirm our hypothesis that a zero-when-unknown encoding (or possibly some other al- ternativ e) will lead to better sounding results. While the keyw ords in performance titles can be used 3 W e generated longer samples and then cropped them because it takes a short time for the network to settle after its initialisation. Form Count Ballade 48 Dance 6 Espagnol 5 Etude 397 Fugue 156 Hungarian 18 Impromptu 155 Intermezzo 13 Mazurka 7 Polonaise 31 Prelude 219 Scherzo 67 T occata 32 V ariations 106 W altz 61 T otal labelled 1321 Unlabelled 1429 T able 3: Number of samples for each form, extracted from the titles of the pieces. to de velop human-interpretable controls, they come with a great deal of noise. F or example, many of the MIDI files in the e-Piano Competition Dataset are actually recordings of multiple performances in sequence. What may be an accu- rate annotation for the first performance in the recording, is often completely inaccurate for the following pieces. V elocity Conditioning The velocity of a note strike describes how har d a note is played. Notes with high velocity are perceiv ed as loud, while notes with lo w velocities are perceived as quiet. By providing a velocity-based conditioning signal to Perfor- mance RNN, we aim to be able to control the perceiv ed volume of generated performances. W e first point out that loud passages are not simply equiv alent to quieter passages but with the v olume turned up, just as yelling is not simply equiv alent to a loud whisper . Nor does the content stay con- stant: the choice of notes, the phrasing, the articulation, may all likely be distributed differently in loud passages when compared to quiet ones, just as what gets yelled is distributed differently , so to speak, from what gets whispered. Indeed, otherwise, increasing and decreasing all the velocities in a piece would have been another effecti ve data augmentation technique. The MIDI standard allows for velocities between 0 and 127, where 0 is the slowest possible velocity and 127 is the fastest. The representation used by Performance RNN is coarser , quantized down by a factor of 4, yielding 32 veloc- ity bins. This provides a simpler input to the model, while still capturing most of the human-detectable difference be- tween note velocities. T o construct our velocity conditioning signal, we further quantized these bins into three approxi- mately equipopulated groups. Roughly , these groups cor- respond to our perception of quiet, normal, and loud notes within a performance. Notes with velocities from 0 to 14, 15 to 19, and 19 to 32 (in the Performance RNN represen- tation) were placed into the quiet, normal, and loud bins re- spectiv ely . T o construct a control signal for the training samples, we measured the distribution of note velocities across the three bins during each training sample (each sample having a duration of approximately 30 s ). This conditioning signal for each sample was thus static through each training mini- batch. At generation time, the human operator can select the velocity distrib ution to generate from, biasing the model to- wards either low , medium, or high velocities as they desire. Sample 5 , Sample 6 , and Sample 12 hav e been conditioned on velocity to be gin quietly , gro w loud, and then end quietly . W e observed that excerpts generated by the model tended to embody a velocity distribution very similar to the con- trol signal. T o quantify how similar the distribution of ve- locities w as, we computed the K ullback–Leibler (KL) di- ver gence from the requested velocity distribution to the autoregressi vely-generated velocity distribution. W e uni- formly sampled 3-bin v elocity distributions ~ h = [ h x , h y , h z ] from the 2-d plane constrained by h x , h y , h z ∈ [0 , 1] and h x + h y + h z = 1 . For each sample ~ h ( i ) , we autoregres- siv ely generated a single 30 s audio clip using our model, trained as described above with a 3-bin v elocity control sig- nal, and measured the distribution of note velocities, ~ ˆ h ( i ) , in the associated MIDI file. W e measured the KL di ver - gence from the distribution of velocities in the control sig- nal to the generated distribution, D KL ( ~ h ( i ) k ~ ˆ h ( i ) ) , and re- peated this process 100 times. The median KL div ergence was 0 . 023 bits , with 95% of samples falling in the range 0 . 001 bits to 0 . 168 bits . This was statistically significantly smaller than our null hypothesis of independent distributions ( p < 0 . 001 ). T o perform a statistical test on the median KL div ergence, we independently sampled tw o 3-bin distribu- tions ~ h ( j, 1) and ~ h ( j, 2) as described above and measured their KL diver gence D KL ( ~ h ( j, 1) k ~ h ( j, 1) ) . This was repeated 100 times, and we took the median over these 100 repetitions to obtain a single estimate for D KL under the null hypothesis; this process was repeated 1000 times. Under the null hy- pothesis, smallest observed value for the median D KL was 0 . 335 bits . W e also constructed a temporally-dynamic velocity con- ditioning signal, using a 5 s forward-looking windo w . At each step of the model’ s training, the conditioning signal corresponded to the distribution of note velocities over the upcoming 5 s worth of e vents. Howe ver , this conditioning signal was very dif ficult to control when generating sam- ples. If a static velocity distribution is used throughout gen- eration, the lack of dynamicism (which was present during training) confuses the model and causes it to generate a near ceaseless stream of note-onsets, refusing to produce either note-off or time-shift events. W e believ e this failure mode is caused by the inconsistency between the history of notes input to the model (which are its o wn pre vious outputs), and the changes in the velocity signal (which does not change). During training, the model can use the recent history of notes and the changes in the velocity signal to more accurately de- termine which velocities to produce; ho wev er when generat- ing samples with a static velocity distribution this relation- ship breaks down. While we could implement a dynamic conditioning signal during training, it is not clear that this would be successful if it was not also coupled to the notes generated by the model. T o increase the resolution of our control over the veloci- ties, we also implemented a 5 class version of the static con- trol signal. Unlike the 3-bin v ariant, our 5-bins were not selected to be equipopulated. Instead we hand-selected bin edges which allowed us to capture the extremes of the dis- tribution, and in turn, a greater degree of control at genera- tion time. Our bins were [0 , 6] , [7 , 14] , [15 , 19] , [20 , 23] , and [24 , 31] , determined in the Performance RNN quantization of velocity . Relative P osition Conditioning A 30 second e xcerpt taken from a piece can vary greatly de- pending upon where in the piece it was taken from. Begin- nings often differ significantly from endings, and climaxes are often distinguishable from both. W ith relati ve-position conditioning our aim is to be able to control roughly what part of a piece a generated performance sounds like. In other words, can we generate performances that sound like the be- ginning or end of a performance? Each MIDI file used to train Performance RNN is aug- mented and split into a series of 30 second examples. W ith relativ e-position conditioning we provide an additional sig- nal to the model indicating what position in the original source piece a particular example was taken from. For in- stance, an example with an initial conditioning signal of zero would begin at the start of a piece, while an example with a signal starting at 0.90 would begin 90% through a perfor- mance. It is important to note that these signals increase within each example. As the e xample progresses through time, the signal increases proportionally . During generation, the control signal is increased relative to the av erage performance length in the dataset. Joint Control Signals It is also possible to condition the model on multiple control signals simultaneously . W e explored the ef fect of condition- ing the model of a pair of control signals at once, for sev eral pairs of particular interest. W e did not attempt to train a model conditioned on more than two control signals simultaneously; if the amount of metadata provided to the model becomes too lar ge, the model will receive enough information to identify e xactly which piece the training sample is from, increasing the risk of ov erfitting. Relative-P osition and Major/Minor Conditioning One problem faced while conditioning on major/minor was that the control signal, derived from the title, was not rep- resentativ e of the entire performance. The key of the piece as stated in the title is often only accurate for the beginning (and end) of the performance. For instance, a piece writ- ten in G Major may modulate to various other ke ys before it returns at the end to G Major . T o counteract this problem, we trained a model condi- tioned jointly on both the key (major/minor) as indicated in the title and the relati ve-position of the sample within the score. This allowed us to generate samples conditioned on the beginning of pieces in either major or minor , where the key signature information would be most accurate. This did not work very consistently , howe ver . Relative-P osition and Composer W e attempted to get our system to generate both beginnings and endings in the style of certain composers 4 . In Sample 7 we can hear a clip generated to ha ve characteristics of a Debussy-esque opening. Generally we found it quite hard to ev aluate whether the outputs indeed sounded like open- ings or not. Endings were generally unsuccessful, although Sample 8 demonstrates an attempt at a Bach ending where one can hear the final cadence a few notes near the very be- ginning, but then the system kept generating material after that. T empo and V elocity Conditioning Using our results from the tempo and velocity conditioned models, we combined the zero-when-unkno wn v ector repre- sentation with the fi ve bin static velocity representation. Our results, especially when combined with beam search (as de- scribed belo w), clearly giv e the user control ov er tempo and velocity . Nevertheless, the resulting samples often achieved their tempo and velocity settings differently than we ex- pected. In some cases, the generated samples contained a great deal of silence. In Sample 9 , we can hear a successful example of joint tempo and velocity control, where the clip was conditioned to start quietly (low velocity) and slowly ( adagio ), and then become loud (high v elocity) and fast ( presto ). Notably , from roughly 0:06–0:09 the slo w part contains a run of very fast notes, but the phrasing is such that it still has an unwav er- ingly slo w feel, while the faster part ne v er gets nearly as f ast as that run, but has a significantly faster feel (although it is not quite as fast as a typical pr esto ). Generation parameters Beam search In the original Performance RNN, music was generated au- toregressi vely , with each output conditioned on the pre vious output. At each generation step, the output for that step is sampled from the distrib ution of possible outputs with prob- abilities equal to the likelihood v alues of each output as pro- vided by the model. The logits can optionally be rescaled with a temperature parameter before the sampling step; a high temperature increases the entropy of the distrib ution, whereas a temperature of 0 is equiv alent to selecting the most likely output at each step. A purely autoregressi ve model is a greedy search, select- ing the output at each single step without consideration for 4 W e use the term “style” loosely here; we do not purport to be capturing the style of any of the composers at a deep lev el, just as many current image style transfer systems are not capturing the style of painters at a deep lev el. the future generation steps. Howe ver , sometimes it is better to select a less lik ely output for the current timestep in return for a payoff later of a more lik ely sequence ov erall. One possible augmentation to this generation procedure is beam search. With beam search, our goal is to gener- ate a series of outputs which collectiv ely hav e a high joint loglikelihood. Throughout the beam search, we hold in memory n beam options (beams) simultaneously , along with the loglikelihood of the sequence for each beam. For each beam, f beam (branch factor) copies are made and for each of these n steps outputs are autoregressi vely generated. Of the f beam · n beam options, the n beam with the highest loglik elihood are retained. This process is repeated until the length of the beams reaches the desired length, whereupon the beam with the highest loglikelihood is selected. W e found beam search was prone to generating outputs with locally low entrop y , such as repeating the same note or same two notes throughout the piece, similar to using plain autoregression with a lo w temperature. Intuitiv ely , this is because generating a large number of samples from a dis- tribution and then selecting the one with the maximum log- likelihood is equiv alent to selecting the sample with highest loglikelihood. T o counteract this problem, we used a low branch factor of f beam = 2 and a high n steps = 240 e vents, a duration equi v alent to approximately 6 seconds of the per - formance. W e also chose n beam = 8 . These parameters gav e good results, but were not heavily optimised and we expect they could be impro ved upon. Another variant of this is stochastic beam search, which selects which beams to retain with probabilities based on their loglikelihoods. W e also tried stochastic beam search (using a temperature of 1) with the same beam search pa- rameters as above, and found this to give perceptually simi- lar results. Discussion The generated results mentioned abov e, along with additional samples, are all available at doi:10.5281/zenodo.3277294 . Some conditioning paradigms giv e more fine-grained in- fluence on the outputs of the model, such as the velocity distribution. Howe ver , these are not necessarily easily in- terpretable by humans interfacing with the model. Mean- while, other controls such as the tempo are more easily un- derstood but offer less nuanced control over the behaviour of the model. Further work is required to determine the best represen- tation for discrete and sparsely annotated control signals. Initial experiments were often framed from a probabilistic viewpoint; when annotations were certain, we used a value of one in its respective component. Howe ver this approach was combined with an “unkno wn” flag. While flags indicat- ing the absence of a meaningful annotation are interpretable, they do not perform well in practice. Specifically , for tempo conditioning, we found that both uniformly distributing the input signal, and a vector of all zeros worked better than a flag approach when annotations were not available. Further experiments should include the expected v alue in place of an unknown annotation. There are numerous trade-of fs that may be at play in the de velopment and the functionality of machine learning (ML) based generativ e music systems. Some of these trade- offs arise from ML-related considerations, while others arise from human computer interaction (HCI) related considera- tions. For example, a typical consideration in ML systems is the av oidance of o verfitting; this is clearly understandable from a statistical perspectiv e, and in our results we made ef- forts to present examples from models that we believe did not overfit. But from a generation perspective, where the goal is to provide artistic tools, some overfitting might not be a particularly negati ve quality , depending on its particu- lar effects, and relativ e to other considerations. For example, consider an auto-regressi ve generati ve model that is slightly ov erfit to certain training examples, i.e. musical passages, so that it occasionally recreates brief excerpts from those pas- sages. This roughly corresponds to the notion of “quoting” other pieces and solos when improvising jazz solos. Artis- tically , that is not problematic at all: there are well-kno wn solos which quote other well-known solos, and the down- wards melodic run in Chopin’ s Fantasie-Impromptu is ver- batim identical to a run at the end of Beetho ven’ s Moonlight sonata. If these are the effects of overfitting, then a bit of it is not necessarily negati v e. Furthermore, if allowing for this can somehow provide an artistic tool with considerably more expressi ve user control, and indeed the user plans to be in volv ed in the manipulation of the generated output, then relativ e to this criteria, the possibility of slight o verfitting — resulting in occasional quoting of the training material — is an ev en lesser concern or possibly a benefit. Conclusion Interpretable controls for an LSTM-based RNN music gen- eration system are possible. In designing such a system, the representation of control signals appears to be an important factor , especially in dealing with sparsely annotated data. There is no question that we are able to control the output of the model at generation time, howe ver , achieving the in- tended musical effect still remains a challenge. Acknowledgements Many thanks to Ian Simon, Sander Dieleman, Douglas Eck, and to the Magenta team at Google Brain. W e also thank Sidath Rankaduw a and Sonia Hellenbrand for assisting with labelling our dataset. This w ork was carried out with the support of CIF AR, Natural Sciences and Engineering Re- search Council of Canada (NSERC) and DeepSense . References Dieleman, S.; v an den Oord, A.; and Simonyan, K. The Challenge of Realistic Music Generation: Modelling Raw Audio at Scale. In 32 nd Confer ence on Neural Informa- tion Pr ocessing Systems (NeurIPS) 2018 . Donahue, C.; Simon, I.; and Dieleman, S. 2018. Piano genie. In NeurIPS 2018 W orkshop on Machine Learning for Cr eativity and Design . Hedges, S. A. 1978. Dice Music in the Eighteenth Century. Music and Letters 59(2):180–187. Herremans, D.; Chuan, C.-H.; and Chew , E. 2017. A Func- tional T axonomy of Music Generation Systems. A CM Comput. Surv . 50(5):69:1–69:30. Huang, C.-Z. A.; V aswani, A.; Uszkoreit, J.; Simon, I.; Hawthorne, C.; Shazeer , N.; Dai, A. M.; Hof fman, M. D.; Dinculescu, M.; and Eck, D. 2019a. Music transformer . In International Confer ence on Learning Repr esentations (ICLR) . Huang, S.; Li, Q.; Anil, C.; Bao, X.; Oore, S.; and Grosse, R. B. 2019b . T imbreT ron: A Wav eNet(CycleGAN(CQT(Audio))) Pipeline for Musi- cal T imbre T ransfer . In International Confer ence on Learning Repr esentations (ICLR) . International Piano e-Competition. 2018. e- Piano Junior Competition, 2002–2018. http: //www .piano- e- competition.com . Accessed: 2019- 01-07. Malik, I., and Ek, C. H. 2017. Neural Translation of Musical Style. In NIPS 2017 W orkshop on Machine Learning for Cr eativity and Design . Oore, S.; Simon, I.; Dieleman, S.; Eck, D.; and Simonyan, K. 2018. This time with feeling: learning expressi ve mu- sical performance. Neural Computing and Applications . Payne, C. 2019. MuseNet. https://openai.com/blog/ musenet/ . Accessed: 2019-05-05. Roberts, A.; Engel, J.; Raffel, C.; Hawthorne, C.; and Eck, D. 2018. A hierarchical latent vector model for learning long-term structure in music. In International Confer ence on Mac hine Learning (ICML) .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment