User Specific Adaptation in Automatic Transcription of Vocalised Percussion

The goal of this work is to develop an application that enables music producers to use their voice to create drum patterns when composing in Digital Audio Workstations (DAWs). An easy-to-use and user-oriented system capable of automatically transcrib…

Authors: Antonio Ramires, Rui Penha, Matthew E. P. Davies

User Specific Adaptation in Automatic Transcription of Vocalised   Percussion
User Specific Adaptation in A utomatic T ranscription of V ocalised P er cussion António Ramires 12 antonio .ramires@inesctec.pt Rui P enha 12 rui.penha@inesctec.pt Matthew E. P . Davies 1 mdavies@inesctec.pt 1 INESC TEC Sound and Music Computing Group Rua Dr . Rober to F rias, s/n, 4200 - 465 P or to , P or tugal 2 F aculty of Engineer ing University of P or to Rua Dr . Rober to F rias, s/n, 4200 - 465 P or to , P or tugal Abstract The goal of this work is to de velop an application that enables music pro- ducers to use their voice to create drum patterns when composing in Digi- tal Audio W orkstations (D A Ws). An easy-to-use and user-oriented system capable of automatically transcribing vocalisations of percussion sounds, called L VT - Live V ocalised Transcription, is presented. 1 L VT is devel- oped as a Max for Liv e device which follows the “segment-and-classify” methodology for drum transcription, and includes three modules: i) an onset detector to segment ev ents in time; ii) a module that extracts rele- vant features from the audio content; and iii) a machine-learning compo- nent that implements the k-Nearest Neighbours (kNN) algorithm for the classification of vocalised drum timbres. Due to the wide dif ferences in vocalisations from distinct users for the same drum sound, a user-specific approach to vocalised transcription is proposed. In this perspectiv e, a gi ven end-user trains the algorithm with their o wn vocalisations for each drum sound before inputting their desired pattern into the D A W . The user adaption is achieved via a new Max external which implements Sequential Forward Selection (SFS) for choosing the most relev ant features for a gi ven set of input drum sounds. The ev aluation of L VT addresses two objectiv es. First, to in vestigate the improv ement in performance with user-specific training, and second, to assess if L VT can pro vide an optimised w orkflow for music production in Ableton Li ve when compared to existing drum transcription algorithms. Obtained results demonstrate that both objectiv es are met. 1 Introduction The dev elopment of computers’ performance capacity , and the conse- quent possibility for real-time Digital Signal Processing (DSP) for audio, led to the appearance of Digital Audio W orkstations (D A Ws), making the creation of computer music av ailable to the general public. Follo wing these adv ances, many new instruments and interfaces for creating elec- tronic music hav e surfaced. W ith changes in music culture, music pro- duction and ho w musicians work with their instruments has also changed. In other words, the ability to in vent and reinvent the way to produce mu- sic is key to progress. Consequently , ne w proposals are necessary , such as designing new techniques for the composition of music. W ithin the genre of Electronic Music, sequencing drum patterns plays a critical role. Howe ver , inputting drum patterns into D A Ws often requires high technical skill on the part of the user, either by physically performing the patterns by tapping them on MIDI drum pads, or manually entering ev ents via music editing software. For non-expert users both options can be ve ry challenging, and can thus pro vide a barrier to entry . Howe ver , the voice is an important and po werful instrument of rhythm production, so it can be used to e xpress or “perform” drum patterns in a v ery intuiti ve way - so called “beatboxing. ” In order to leverage this concept within a com- putational system, our goal is to wards a system to help users (both expert musicians and amateur enthusiasts) input rhythm patterns they have in mind into a sequencer via the automatic transcription of vocalised per- cussion. Our proposed tool is beneficial both from the perspective of workflo w optimisation (by providing accurate real-time transcriptions), but also as means to encourage users to engage with technology in the pursuit of creative activities. From a technical standpoint, we seek to build on the state of the techniques from the domain of music information retriev al (MIR) for drum transcription [2, 4] b ut acti vely tar geted to wards end-users and real-world music content production scenarios. 1 This work is deriv ed from the MSc dissertation of António Ramires, conducted in the Department of Electrical and Computer Engineering in the Faculty of Engineering, Uni versity of Porto. 2 Methodology A vocalised drum transcription software, L VT , able to be trained with the user vocalisations is proposed. L VT is dev eloped as a Max for Live project – a visual programming environment, based on Max 7 2 , which allows users to build instruments and effects for use within the Ableton Liv e 3 D A W . T o de velop L VT , a dataset of v ocalised percussion was com- piled. A group of 20 participants (11 male, 9 female) were asked to record two short vocalised percussion tracks, one identical for all participants, and the other , an improvised pattern. These input percussion tracks were recorded three times: on a lo w quality laptop microphone, on an iPad microphone, and using a studio quality microphone (AKG c4000b). All recorded audio tracks were manually annotated using Sonic V isualiser 4 , a free application for viewing and analysing the contents of music audio files. The participants spanned a wide range of experience in beatboxing (from beatboxing e xperts, to those who had ne ver vocalised drum patterns before), and covered a wide age range. Thus, we consider the annotated dataset to be representative of a wide range of potential users of the sys- tem, and highly heterogeneous in terms of the types of drum sounds. Our proposed vocalised percussion transcription system was dev el- oped following a user-specific approach. L VT follows the “segment and classify” method for drum transcription [2] and integrates three main el- ements: i) an onset detector – to identify when each drum sound occurs, ii) a component that extracts features for each ev ent, and iii) a machine learning component to classify the drum sounds. In the Max for Live en vironment, the onset detection was performed with AubioOnset ∼ 5 . Feature extraction was performed in real-time using existing Max objects: Zsa.mfcc ∼ – to characterise the timbre, Zsa.descriptors [3] – to provide spectral centroid, spread, slope, decrease and rolloff features [3], and finally the zero crossing rate and number of zero crossings were cal- culated with the zerox ∼ object. The machine learning component is trained with the user’ s preferred v ocalisation and the features are selected which give the best results for the provided input. This is achieved us- ing the Sequential Forward Selection method [5] along with a k-Nearest Neighbours classification algorithm, with the most significant features se- lected by the accuracy obtained from testing the training data (in our case, the annotated improvised patterns from each participant). SFS works by selecting the most significant feature, according to a specific parameter (in this case the classification accuracy), and adding it to an initially empty set until there are no improv ements or no features remain. The k-NN algorithm was implemented using timbreID[1], and a ne w external for Max was dev eloped to implement the SFS. A user interface was created in Max for Li ve to facilitate the utilisation of the application by end-users. A screenshot of the interface of L VT is shown in Fig. 1. It demonstrates the user-specific training stage – where a user inputs a set number of the drum timbres they intend to use, after which their v ocalised percussion is transcribed and rendered as a MIDI file for subsequent synthesis. T o operate L VT , a user loads the device in Ableton Li ve and then vo- calises the set of desired drum sounds they intend to use, e.g. five kick sounds follo wed by five snare sounds, followed by five hi-hat sounds. Once the expected number of drum sounds have been detected, the SFS algorithm then identifies the subset of features which best separate the drum sounds for the user . After training, the user can then vocalise rhyth- mic patterns which are automatically con verted from audio to a MIDI representation in the D A W for later synthesis and editing. 2 www .cycling74.com 3 http://www .ableton.com/en/ 4 http://www .sonicvisualiser.or g/ 5 https://aubio.org/manpages/latest/aubioonset.1.html Figure 1: User interface of the L VT de vice. T able 1: Number of operations and F-measure for the AKG microphone. Edit Operations F-measur e Modify Add Remov e Kick Snare Hi-hat Ableton 33 12 296 0.518 0.470 0.297 LDT 52 24 206 0.538 0.204 0.419 L VT 39 7 15 0.914 0.691 0.802 3 Results The e valuation of L VT was designed to serve two purposes. First, to understand how a user-specific trained system performs against state of the art drum transcription system (which hav e been optimised over large datasets without any user-specific training), and second, to explore how L VT could improve a producer’ s workflo w . W e compared L VT against two existing drum transcription algorithms: LDT [4], and Ableton Liv e’ s built-in “Con vert Drums to MIDI” function. For validation data we used the non-improvised v ocalised patterns from our annotated dataset. T o compare the accurac y of the systems we use the F-measure of the transcriptions. Then, to inv estigate how our system could improve a producers workflo w , the “effort” to get an accurate transcription was calculated by counting the number of editing operations required to obtain the desired patterns. These operations are as follows: to modify , to add , or to r emove a MIDI note. T able 1 summarises the results obtained from counting the total num- ber of operations needed to obtain the desired pattern for the testing data recorded on the studio quality AKG c4000b microphone and the corre- sponding F-measure per vocalised drum sound, on the three drum tran- scription systems. The results demonstrate that, for the studio quality mi- crophone, vocalised drum transcription accuracy for L VT is substantially higher than the other systems, and far fewer modifications were required to obtain the desired patterns when editing the automatic transcriptions. T o see the effect of user-specific training on the performance of L VT , an example is provided where L VT is trained on one user and tested on another – and vice-versa. When training the L VT with a different person with dif ferent v ocalisations, the accurac y of the transcription is decreased as sho wn in Fig. 2. In the upper part of each screenshot is the transcription of the user when trained with its o wn v ocalisations, while the bottom part corresponds to the transcription when trained with the other user . As can be seen, without the user -specific training, many misclassifications occur . By examining the previously obtained results, we infer that L VT can provide a transcription closer to the ground truth than the e xisting state of the art systems, as shown by the higher F-measure. In addition to L VT be- ing trained per individual user , these results may also deri ve from the f act that L VT does not try to detect polyphonic ev ents (more than one drum vocalisation at the same time) as the other systems do. Furthermore, L VT does not detect as many e vents as the other systems, and this has a strong influence on the number of false positi ves, and hence the F-measure. The number of ev ents to achieve the desired transcription, presented in T a- ble 1, shows that the end-user of the system does not hav e to perform as many actions when producing music, which has a positiv e impact on the workflo w , leaving more time for creati ve experimentation. 4 Conclusions In this paper , we have presented L VT – a new interface for assistiv e mu- sic content creation. L VT allows Ableton Live users to sequence MIDI Figure 2: (top) First user v ocalisations trained with the second user . (bot- tom) Second user vocalisations trained with the first user . patterns that can be used for designing and performing rhythms with their voice. Existing state of the art systems, including one already in Ableton Liv e, are not able to transcribe v ocalised percussion as effecti vely be- cause these tools are trained for general recorded drum sounds which are typically not vocalised. Indeed, because different people vocalise drum sounds in different ways, L VT explicitly seeks to model and capture this behaviour via user -specific training. Our e valuation shows L VT to be v ery effecti ve for wide range of users and vocalisations, outperforming exist- ing systems. Furthermore, we believe L VT can be applied to any kinds of arbitrary non-pitched percussive sounds – provided that the training sound types are sufficiently different from one another , and can thus be well separated in the audio feature space using SFS. L VT is implemented as a Max for Live device, and thus fully inte- grates into Ableton Li ve, allo wing users of all ability ranges to e xperiment with music sequencing driv en by their own personal percussion vocalisa- tions within an easy-to-use graphical user interface. 5 Acknowledgements This work is financed by the ERDF - European Regional Dev elopment Fund through the Operational Programme for Competitiveness and Inter - nationalisation - COMPETE 2020 Programme within project «POCI-01- 0145-FEDER-006961», and by National Funds through the FCT - Fun- dação para a Ciência e a T ecnologia (Portuguese Foundation for Science and T echnology) as part of project UID/EEA/50014/2013. Project TEC4Growth-Pervasi ve Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NOR TE-01-0145-FEDER-000020 is financed by the North Portugal Regional Operational Programme (NOR TE 2020), under the POR TUGAL 2020 Partnership Agreement, and through the European Regional De velopment Fund (ERDF). References [1] W . Brent. A timbre analysis and classification toolkit for pure data. In Pr oc. of ICMC , pages 224–229, 2010. [2] O. Gillet and G. Richard. T ranscription and separation of drum sig- nals from polyphonic music. IEEE T ransactions on Audio, Speech, and Language Pr ocessing , 16(3):529–540, March 2008. [3] M. Malt and E. Jourdan. Zsa. descriptors: a library for real-time descriptors analysis. In Pr oc. of 5th SMC Confer ence , pages 134– 137, 2008. [4] M. Miron, M. E. P . Davies, and F . Gouyon. An open-source drum transcription system for Pure Data and Max MSP. In Pr oc. of ICASSP , pages 221–225, May 2013. [5] A. W . Whitney . A direct method of nonparametric measurement se- lection. IEEE T rans. Comput. , 20(9):1100–1103, September 1971. 2

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment