Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Acoustic Modeling f or A utomatic L yrics-to-A udio Alignment Chitralekha Gupta, Emr e Yılmaz, Haizhou Li Department of Electrical and Computer Engineering, National Uni versity of Singapore { chitralekha, emre, haizhou.li } @nus.edu.sg Abstract Automatic lyrics to polyphonic audio alignment is a challeng- ing task not only because the vocals are corrupted by back- ground music, but also there is a lack of annotated polyphonic corpus for effecti ve acoustic modeling. In this work, we pro- pose (1) using additional speech and music-informed features and (2) adapting the acoustic models trained on a large amount of solo singing vocals towards polyphonic music using a small amount of in-domain data. Incorporating additional information such as voicing and auditory features together with conv entional acoustic features aims to bring robustness against the increased spectro-temporal variations in singing vocals. By adapting the acoustic model using a small amount of polyphonic audio data, we reduce the domain mismatch between training and testing data. W e perform sev eral alignment experiments and present an in-depth alignment error analysis on acoustic features, and model adaptation techniques. The results demonstrate that the proposed strate gy pro vides a signiﬁcant error reduction of word boundary alignment over comparable existing systems, espe- cially on more challenging polyphonic data with long-duration musical interludes. Index T erms : L yrics-to-audio alignment, ASR, model adapta- tion, speech and music informed features 1. Introduction The goal of an automatic lyrics-to-audio alignment algorithm is the time synchronization between the lyrics and the singing vocals with or without background music. It potentially enables various applications such as generating karaoke scrolling lyrics, music video subtitling, and music retriev al. The task of lyrics-to-audio alignment is often seen as an e x- tension of the speech-to-te xt alignment task. ASR systems have been used to force-align lyrics to singing vocals [1–5]. Singing voice, ho wev er , cov ers a much wider range of intrinsic varia- tions than speech both in terms of timbre and fundamental fre- quencies [6]. One can reduce the mismatch between speech and singing signals by adapting the speech acoustic models with a small amount of singing data using maximum a posterior (MAP) or maximum likelihood linear regression (MLLR) [4, 5]. Mesaros et al. [4] used 49 fragments of songs, 20-30 seconds long, along with their manual transcriptions to adapt Gaussian mixture model (GMM)-hidden Markov model (HMM) speech models for singing. These studies provide a direction for solv- ing the problem of lyrics alignment in music, but they suf fer from a lack of lyrics annotated data. Kruspe [7] and Dzhambazov [8] presented systems for the lyrics alignment challenge in MIREX 2017. The acoustic mod- els in [7] were trained using 6,000 songs from the Smule’ s pub- lic solo-singing karaoke dataset called Digital Archive of Mo- bile Performances (D AMP) [9]. This dataset is collected via a karaoke app, therefore has no consistent recording condition, contains out-of-vocab ulary words, and incorrectly pronounced words because of unfamiliar lyrics [5]. Moreov er , the dataset does not hav e lyrics time annotation. Gupta et al. [5] designed a semi-supervised algorithm to au- tomatically obtain weak line-lev el lyrics annotation of a subset of approximately 50 hours of solo-singing D AMP data. They adapted DNN-HMM speech acoustic models to singing voice with this data, that showed 36.32% w ord error rate (WER) in a free-decoding e xperiment on short solo-singing test phrases from the same dataset. In [10], these singing-adapted mod- els were further enhanced to capture long duration vo wels with a duration-based lexicon modiﬁcation, that reduced the WER to 29.65%. Ho we ver , acoustic models trained on solo-singing data result in a signiﬁcant drop in performance when applied to singing vocals in the presence of background music 1 . Singing vocals are often highly correlated with the corresponding back- ground music, resulting in overlapping frequency components [6]. The varied range of voice quality of artists combined with different types of musical instruments makes the problem of lyrics alignment highly challenging in polyphonic music. T o suppress the background accompaniment, some ap- proaches ha ve incorporated singing voice separation techniques as a pre-processing step [1, 4, 8, 11]. Howe ver , this step makes the system dependent on the performance of the singing voice separation algorithm, as the separation artifacts may make the words unrecognizable. Moreover , this requires a separate train- ing setup for the singing voice separation system. Recently , multiple researchers hav e explored data inten- siv e approaches to lyrics-to-audio alignment. In MIREX 2018, W ang [12] presented a system that achieved a mean alignment error (AE) of 4.12 seconds on a standard test data for word alignment e v aluation (Mauch’ s polyphonic dataset [2]). They used 7,300 annotated English songs from KKBO X Inc. ’ s music library to train GMM-HMM models. Stoller et al. [13] pre- sented an end-to-end system based on the W av e-U-Net archi- tecture that predicts character probabilities directly from raw audio. The system was trained on more than 44,000 songs with line-lev el lyrics annotations from the Spotify’ s music library . They achieved an impressive 0.35s mean AE on the Mauch’ s dataset. Howe ver , end-to-end systems require a large amount of annotated training polyphonic music data to perform well as seen in [13], while publicly av ailable acoustic resources for polyphonic music are limited. In this study , we explore the use of additional speech and music-informed features, along with the standard acoustic fea- tures during the acoustic model training for singing voice. In addition, we adapt an acoustic model trained on a large amount of solo singing vocals using a limited amount of annotated poly- phonic data to reduce the domain mismatch. The aim is to in ves- tigate the performance of content-informed features and adap- tation methods in capturing the spectro-temporal characteristics of singing voice in polyphonic music. 1 https://www.music- ir.org/mirex/wiki/2017: Automatic_Lyrics- to- Audio_Alignment_Results 2. Speech and music-inf ormed features Speech and singing hav e many similarities because they share the underlying physiological mechanisms for production, such as articulatory movements in vocal production [14, 15]. Mod- ern ASR systems use conv entional acoustic features such as mel-scaled cepstral coefﬁcients (MFCC) to capture the pho- netic aspects in conjunction with speaker representations such as i-vectors [16] to capture speaker information. These features hav e been widely used for v arious MIR tasks such as genre classiﬁcation, artist, and song identiﬁcation [17–19]. Howe ver , the acoustic characteristics of singing and speech also differ in many ways, such as pitch range, vibrato, and phoneme dura- tion [20, 21]. Moreover , the presence of dif ferent kinds of musi- cal accompaniments, along with singing vocals, constitute addi- tional frequency components in the music signal, that may ren- der the lyrics unrecognizable [6]. W e hypothesize that includ- ing additional speech and music informed low-le vel descriptors for acoustic modeling of sung lyrics will result in improv ed lyrics-to-audio alignment. Lo w-le vel descriptors provide dis- criminatory information about the temporal variations of the background music and the transitions between sung phonemes and notes, in addition to the timbral information provided by the con ventional MFCC features. The open-source feature e xtractor called OpenSMILE (or Open Speech and Music Interpretation by Large-space Extrac- tion) [22] unites feature extraction algorithms from the speech processing and the MIR communities. It pro vides various au- dio low-le vel descriptors (LLD) that have been widely used for emotion recognition in speech [23], as well as for summariza- tion [24], mood classiﬁcation [25], and singing quality assess- ment [26] in music. In this work, we ha ve divided these features into ﬁve feature groups, namely voicing (V), energy (E), audi- tory (A), spectral (S), and c hr oma (C) , as described in T able 1. As indicated in early studies in speech-music discrimina- tion [27, 28], the distribution of the ﬁrst dif ferential of pitch in singing voice shows a high concentration around zero delta pitch corresponding to steady notes. A similar behavior is ob- served for the delta amplitude as well. Also, lar ge changes in pitch are observed in singing corresponding to transition be- tween notes. These aspects are covered by the voicing and en- er gy feature groups. Singing vocals in presence of background music or chorus is similar to speech in the presence of noise. Relativ e spectra (RAST A) [29] is a ﬁltered representation of an audio signal that is robust to additiv e and conv olutional noise. It essentially sup- presses the spectral components that change more quickly or slowly compared to the typical range of speaking rate. There- fore, the auditory feature group is e xpected to be robust to back- ground music and chorus. Spectral group of features represent the “musical surf ace” which denote the characteristics of music related to texture, tim- bre and instrumentation, as coined by Tzanetakis et al. [30]. The statistics of the distribution of various spectral descriptors such as spectral centroid, ﬂux, energy over time represent the musical surface for pattern recognition purposes. Chr oma features have been used previously for tasks such as cover song identiﬁcation, and music audio classiﬁcation [31]. These features consist of a 12-element vector with each dimen- sion representing the intensity associated with a particular musi- cal semitone. While spectral features such as MFCCs represent the timbral characteristics, chroma features reﬂect the harmonic and melodic content of the music signal, and are shown to pro- vide information independent of the spectral features [31]. T able 1: Description of 5 acoustic feature gr oups. Group ID Featur e Group Description #LLDs A Auditory RAST A-style auditory spectrum bands 1-26 (0-8 kHz) 26 + deltas E Energy Sum of auditory spectrum (loudness), sum of RAST A-style auditory spectrum, RMS energy , zero crossing rate 4 + deltas C Chroma Intensities in 12 musical semitones 12 S Spectral Spectral energy 250-650 Hz, 1 k-4kHz Spectral Roll Off Point 0.25, 0.50, 0.75, 0.90 Spectral Flux, Entropy , V ariance, Ske wness, Kurtosis, Slope, Psychoacoustic Sharpness, Harmonicity 15 + deltas V V oicing F0, V oicing, Jitter (local, delta), Shimmer, Log arithmic HNR 6 + deltas 3. Model adaptation f or domain mismatch Our goal is to build a framework to automatically align lyrics to the polyphonic music audio. W ith an acoustic model trained on solo-singing data, we can adapt the model towards the test data in two w ays: (a) by making the test data closer to the trained solo-singing acoustic models by applying vocal separation on polyphonic test data, (b) by adapting the acoustic models to polyphonic data. In [11], the former approach was explored. But source separation algorithms are known to introduce arti- facts in the extracted vocal, thus the pipeline gets dependent on the reliability of the source separation algorithm. In this work, we inv estigate the latter approach, i.e. adapting the acous- tic model using a small amount of in-domain polyphonic data to reduce the domain mismatch. Model adaptation is achieved by initializing the hidden layers using the neural network acous- tic model trained on the solo-singing data and retraining this model by performing e xtra forw ard-backward passes only us- ing the av ailable polyphonic training data for a small number of epochs and possibly with a smaller learning rate. As discussed earlier , acoustic modeling of singing vocals in the presence of background music is constrained by a lack of lyrics annotated data. Recently , a multimodal DALI dataset [32] was introduced, that consists of 5,000+ polyphonic songs with note annotations and weak word-lev el, line-level, and paragraph-lev el lyrics annotations. It was created with a set of initial manual annotations of time-aligned lyrics made by non- expert users of Karaoke games, where the audio was not av ail- able. The corresponding audio candidates were then retrieved from the web, and an iterative method of obtaining a large-scale lyrics annotated polyphonic music data was proposed. How- ev er , the reliability of these lyrics annotations have not been veriﬁed. The authors hav e released 105 songs as the ground- truth data, where the annotations are manually checked and cor - rected. In this work, we make use of this ground-truth data for domain adaptation. 4. Experimental setup W e conduct two sets of experiments to study the impact of our proposed acoustic modeling strategies for lyrics alignment: (1) we ﬁrst assess the effect of the speech and music informed fea- tures on lyrics alignment in solo-singing, and (2) then we in ves- tigate the ef fects of these features in polyphonic music lyrics alignment, along with model adaptation techniques. In this sec- tion, we detail the datasets used for the experiments, acoustic model architecture, the system conﬁgurations, and ev aluation metrics for assessing the quality of the boundaries. 4.1. Datasets All datasets used in the experiments are summarized in T able 2. The training data for solo-singing acoustic modeling is ap- proximately 50 hours of the D AMP dataset [5, 9] that has weak line-lev el lyrics transcription. W e use the D ALI ground-truth data for domain adaptation of the acoustic models to the poly- T able 2: Dataset description. (solo: solo-singing; poly: singing mixed with music) Name A udio type Content L yrics Ground-T ruth A vg W ord Length(s)/# words T raining/Adaptation data DAMP train [5] solo 35,662 lines line-level weak transcription - DALI train [32] poly 70 songs word and line-lev el boundaries T est data DAMP test [5] solo 1697 lines line-level transcription - Hansen-solo [33] solo 7 songs word-lev el boundaries 0.485 / 2212 Hansen-poly [33] poly 7 songs word-lev el boundaries 0.485 / 2212 Mauch-poly [2] poly 20 songs word-level boundaries 0.871 / 5052 DALI de v [32] poly 9 songs w ord and line-level boundaries 0.471 / 2305 DALI test [32] poly 20 songs word and line-lev el boundaries 0.442 / 5260 T able 3: System conﬁgur ations. Baseline acoustic models are trained on DAMP subset-train (T able 2). AECSV ar e the featur e gr oup IDs fr om T able 1. Conﬁgs Adaptation data Featur es C1 - MFCC, i-vectors C2 - MFCC, i-vectors, AECSV C3 vocal-extracted D ALI MFCC, i-vectors C4 vocal-extracted D ALI MFCC, i-vectors, AECSV C5 polyphonic D ALI MFCC, i-vectors C6 polyphonic D ALI MFCC, i-vectors, AECSV phonic music. It consists of 99 songs 2 , that we di vided into train, dev elopment (dev), and test, in the ratio of 70:9:20. W e e v aluated our alignment systems on two datasets - 7 songs 3 from the Hansen’ s a capela and polyphonic datasets [33], and 20 songs of the Mauch’ s polyphonic dataset [2]. These datasets were used in the MIREX lyrics alignment challenges of 2017 and 2018. These datasets consist of W estern pop songs with manually annotated word-level boundaries. W e tune our model adaptation scheme on the DALI-de v set, and also report alignment results on the D ALI-test set. 4.2. ASR architectur e The ASR system used in these experiments is trained using the Kaldi ASR toolkit [34]. A conte xt dependent GMM-HMM sys- tem is trained with 40k Gaussians using 39 dimensional MFCC features including the deltas and delta-deltas to obtain the align- ments for neural network training. The frame rate and length are 10 and 25 ms, respectively . A factorized time-delay neural network (TDNN-F) model [35] with additional con volutional layers (2 con v olutional, 10 time-delay layers followed by a rank reduction layer) was trained according the standard Kaldi recipe (version 5.4). An augmented version of the solo-singing train- ing data described in Section 4.1 is created by reducing (x0.9) and increasing (x1.1) the speed of each utterance [36]. This augmented training data is used for training the neural network- based acoustic model. The default hyperparameters provided in the standard recipe were used and no hyperparameter tuning was performed during the acoustic model training. The base- line acoustic model is trained using 40-dimensional MFCCs as acoustic features that are combined with i-vectors [37]. Dur- ing the training of the neural network [38], the frame subsam- pling rate is set to 3 providing an ef fectiv e frame shift of 30 ms. A duration-based modiﬁed pronunciation lexicon is employed which is detailed in [10]. 4.3. System conﬁgurations The baseline acoustic model (C1) is trained on solo-singing D AMP subset-train with the 40-dimensional MFCCs and 100- dimensional i-v ectors. T o test the performance of the additional features, extracted using OpenSMILE toolbox [22], we append 2 There are a total of 105 songs in the ground-truth data, out of which the audio ﬁle links to 6 songs are not accessible from Singapore. 3 The word boundary ground-truth of the songs cloc ks and i kissed a girl were not accurate, hence excluded from this study the ﬁ ve feature groups with a total dimension of 154 to the 140- dimensional baseline feature vector (C2). W e also analyse the contribution of each feature group by appending only one fea- ture subset, e g. C2-V , C2-A, C2-E etc. W e adapt the baseline model with the vocal-extracted DALI-train data (C3, C4), and polyphonic D ALI-train data (C5, C6). W e use the state-of-the- art implementation of the W av e-U-Net based audio source sep- aration [39] for vocal e xtraction from the polyphonic audio. 4.4. Evaluation metrics Mean AE is the absolute error or deviation in seconds from the predicted to the true word start times, a veraged over all words in a dataset. Previous studies have reported this metric, but mean AE is affected drastically by outliers. Therefore, to gauge the distribution of alignment errors, we also present me- dian (Med.), standard deviation (Std.) of the absolute boundary errors. Moreover , we measure the percentage of hypothesized word boundaries that are within an acceptable tolerance interv al around the ground-truth boundary (i.e. %Correct or %C). Ob- serving the range of average word durations in T able 2, we set this acceptable tolerance interval as approximately half the aver - age duration of words, i.e. the percentage of word-start bound- aries within 250 ms of the ground-truth. T able 4: Mean AE performance on Hansen’s solo-singing data with models trained on DAMP solo-singing data. The median of absolute boundary err ors in all cases in this table is 0.03s. Conﬁg Mean(s) Std.(s) %C C1 0.20 0.75 91.5 C2 0.13 0.63 94.1 C2-A 0.17 0.95 92.3 C2-E 0.30 1.73 91.7 C2-C 0.32 1.84 90.7 C2-S 0.24 1.36 92.7 C2-V 0.48 1.75 87.6 5. Results and discussion 5.1. Perf ormance on solo-singing In the ﬁrst set of experiments, we explore the effect of each of the speech and music informed feature groups combined with MFCCs and i-vectors. The alignment results provided by differ- ent feature conﬁgurations on the Hansen’ s solo-singing dataset is shown in T able 4. Training the solo-singing acoustic models with the additional features reduces the av erage boundary error from 200 ms to 130 ms, while the standard deviation and the %C also improv e. W e also observe that the auditory and the spectral feature groups individually contribute to the improved performance. Many songs in this solo-singing dataset contain chorus sections, where other singers and the main singer may sing different lyrics at the same time. The robust RAST A fea- tures in auditory group is observed to be helpful in such cases. Moreov er , the individual groups perform w orse than their com- bination, which implies that the groups provide exclusiv e infor - mation that complement each other . 5.2. Perf ormance on polyphonic audio T o reduce the domain mismatch between solo-singing acoustic models and the polyphonic test data, we adopt three approaches: T able 5: Mean AE for various adaptation conﬁgur ations (LR: same initial learning rate; 0.5LR: half of initial learning r ate). Conﬁg - > C1 LR, epoch1 LR, epoch2 LR, epoch3 0.5LR, epoch1 0.5LR, epoch2 0.5LR, epoch3 D ALI-dev 0.288 0.170 0.182 0.173 0.171 0.198 0.201 D ALI-test 0.343 0.159 0.162 0.163 0.156 0.176 0.174 T able 6: AE performance on vocal-extracted Hansen-poly and Mauch-poly data. Hansen-poly Mauch-poly Med.(s) Mean(s) Std.(s) %C Med.(s) Mean(s) Std.(s) %C C1 0.23 2.33 5.10 51.4 1.49 14.31 22.37 32.8 C2 0.15 0.94 2.76 69.9 0.26 4.05 8.30 49.0 C3 0.82 6.84 11.92 41.1 1.61 12.47 20.39 34.9 C4 0.21 2.35 5.22 59.6 0.36 5.19 9.52 41.8 T able 7: AE performance on Hansen-poly and Mauch-poly data. Hansen-poly Mauch-poly Med.(s) Mean(s) Std.(s) %C Med.(s) Mean(s) Std.(s) %C C1 30.10 36.20 31.85 14.5 20.33 39.70 48.55 10.5 C2 2.88 9.57 13.38 27.7 2.93 14.69 22.59 25.8 C5 0.08 1.82 5.72 71.8 0.15 3.78 9.98 60.9 C6 0.11 2.37 6.85 64.7 0.18 1.93 5.90 57.5 (a) vocal extraction of the polyphonic test data, as done in previ- ous studies [4, 7, 11], (b) adapt the models with vocal extracted polyphonic data, and (c) adapt the models with polyphonic data. W e used DALI-train for adaptation, and D ALI-de v to optimize the alignment performance (mean AE) by adjusting the initial learning rate (LR) and the number of epochs, as sho wn in T a- ble 5. W e choose the setting that performs the adaptation using the same initial LR within a single epoch as it giv es the best performance on the de velopment set. The best result reported on the DALI-test set is also obtained using this setting. Please note that the D ALI-test data contains short lines or utterances of 3-10s, which is different from the other test sets in which the entire song of 2-3 mins. is gi ven to the system. The short dura- tion of the DALI-test set results in relatively smaller mean AE values. 5.2.1. On vocal-extr acted polyphonic test data T able 6 summarizes the performance of solo-singing models (C1, C2) and adapted models with extracted vocals (C3, C4) with and without the additional features on the vocal extracted Hansen-poly and Mauch-poly test datasets. W e observe that model adaptation does only a slight dif ference in the perfor- mance (cf. C1, C3), but the additional features improve the performance by a large margin (cf. C1, C2). MFCC features are kno wn to be sensiti ve to background noise [40]. So, domain adaptation with the extracted vocals containing distortion and artifacts is a possible reason for the poor performance of the adapted models. On the other hand, the additional features are designed to be rob ust to noise, thus impro ving the performance. 5.2.2. On polyphonic test data (without vocal extr action) T able 7 sho ws the lyrics alignment performance of the un- adapted (C1, C2) and polyphonic data adapted (C5, C6) acous- tic models on the Hansen-poly and Mauch-poly data. The poor performance of the solo-singing models (C1, C2) on polyphonic data is expected due to domain mismatch. But here, the domain adaptation (C5, C6) giv es a considerable improvement in per- formance. A comparison of T able 6 and 7 sho ws that domain adaptation without vocal extraction performs better . This sug- gests that domain adaptation with a small amount of polyphonic data helps the acoustic model capture the spectro-temporal v ari- ations of singing vocals, which offers a simple, b ut effectiv e solution in scenarios with limited polyphonic singing data. One main difference between the Hansen’ s and Mauch’ s datasets is that the songs in the Mauch’ s dataset are rich in long- duration musical interludes that ha ve no singing vocals, while Hansen’ s has only a few of such interludes. W e observ e that the content-informed features and domain adaptation help to im- prov e the boundaries next to these long interludes. Thus, the Figure 1: Comparison of wor d boundary alignment error dis- tribution between C1 on extracted vocals test data and C6 on polyphonic test data on (a) Hansen’ s and (b) Mauch’ s datasets. T able 8: Comparison of mean AE (s) with existing literatur e. MIREX 2017 MIREX 2018 ICASSP 2019 AK [7] GD [8, 41] CW [12] DS [13] CG [11] Ours T raining data 6,000 songs (DAMP) (solo) 6,000 songs (DAMP) (solo) 7,300 songs (KKBOX) (poly) 44,232 songs (Spotify) (poly) 35,662 lines (DAMP) (solo) 35,662 lines (DAMP) (solo) + 70 songs (DALI) (poly) Architecture DNN-HMM DNN-HMM GMM-HMM UNet based end-to-end SA T DNN-HMM CNN-TDNN-F Hansen-poly 7.34 10.57 2.07 - 1.39 0.93 (median: 0.15) Mauch-poly 9.03 11.64 4.13 0.35 6.34 1.93 (median: 0.18) improv ement in alignment performance is more evident in the Mauch’ s dataset, than in the Hansen’ s dataset. Although the mean AE of the boundaries is more than a second, the median of errors is less than 180 ms for the best performing systems. A comparison of the boundary error dis- tribution of C1 on extracted vocals, and C6 on polyphonic data (Figure 1) shows a large increase in the number of boundaries tow ards zero error, for both the datasets. This also means that the there are hypothesized boundaries that are far away from the the true boundaries, that needs to be in vestigated in future. 5.3. Comparison with existing literature In T able 8, we compare our best results with the past studies, and ﬁnd that our strategy provides better results than all previ- ous work, e xcept for the end-to-end system [13]. An end-to-end system requires a large amount of data for reliable output which we do not have access to. Our proposed strategies show a way to fuse knowledge-dri ven and data-driv en methods to address the problem of lyrics-to-audio alignment in a lo w-resourced setting. 6. Conclusions In this study , we discuss two strategies to obtain improved acoustic modeling for the task of lyrics-to-audio alignment. Par- ticularly , we propose to (1) employ additional features with speech- and music-related information together with conv en- tional MFCCs, and (2) adapt solo-singing acoustic model using small amount of in-domain polyphonic data. W e validated the robustness of these features to background music and ability to capture the spectro-temporal variations in polyphonic singing vocals. The alignment e xperiments demonstrate that applying the described strate gies reduces the mean AE to 1.9s on the Mauch’ s dataset which is better than all results reported in the MIREX lyrics alignment challenge. 7. Acknowledgments This research is supported by Ministry of Education, Singapore AcRF Tier 1 NUS Start-up Grant FY2016, Non-parametric ap- proach to voice morphing. 8. References [1] H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno, “Lyricsynchro- nizer: Automatic synchronization system between musical audio signals and lyrics, ” IEEE J ournal of Selected T opics in Signal Pr o- cessing , vol. 5, no. 6, pp. 1252–1261, 2011. [2] M. Mauch, H. Fujihara, and M. Goto, “Integrating additional chord information into HMM-based lyrics-to-audio alignment, ” IEEE T ransactions on A udio, Speech and Language Pr ocessing , vol. 20, no. 1, pp. 200–210, 2012. [3] M. McV icar, D. P . Ellis, and M. Goto, “Le veraging repetition for improved automatic lyric transcription in popular music, ” in Pr oc. ICASSP , 2014, pp. 3117–3121. [4] A. Mesaros and T . V irtanen, “ Automatic recognition of lyrics in singing, ” EURASIP J ournal on A udio, Speech, and Music Pro- cessing , vol. 2010, p. 4, 2010. [5] C. Gupta, R. T ong, H. Li, and Y . W ang, “Semi-supervised lyrics and solo-singing alignment, ” in Proc. ISMIR , 2018. [6] M. Ramona, G. Richard, and B. David, “V ocal detection in music with support vector machines, ” in 2008 Proc. ICASSP . IEEE, 2008, pp. 1885–1888. [7] A. M. Kruspe, “Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing, ” in Proc. ISMIR , 2016, pp. 358–364. [8] G. B. Dzhambazov and X. Serra, “Modeling of phoneme dura- tions for alignment between polyphonic audio and lyrics, ” in 12th Sound and Music Computing Confer ence , 2015, pp. 281–286. [9] S. Sing!, “Smule.digital archive mobile performances(damp), ” https://ccrma.stanford.edu/damp/, 2010 (accessed March 15, 2018). [10] C. Gupta, H. Li, and Y . W ang, “ Automatic pronunciation ev alua- tion of singing, ” Proc. INTERSPEECH , pp. 1507–1511, 2018. [11] B. Sharma, C. Gupta, H. Li, and Y . W ang, “ Automatic lyrics- to-audio alignment on polyphonic music using singing-adapted acoustic models, ” in Proc. ICASSP . IEEE, 2019. [12] C.-C. W ang, “Mirex2018: L yrics-to-audio alignment for instru- ment accompanied singings, ” in MIREX 2018 , 2018. [13] S. E. Daniel Stoller , Simon Durand, “End-to-end lyrics align- ment for polyphonic music using an audio-to-character recogni- tion model, ” in Proc. ICASSP . IEEE, 2019. [14] R. J. Zatorre and S. R. Baum, “Musical melody and speech into- nation: Singing a different tune, ” PLoS biology , vol. 10, no. 7, p. e1001372, 2012. [15] S. Zhang, R. C. Repetto, and X. Serra, “Study of the similarity between linguistic tones and melodic pitch contours in Beijing opera singing. ” in Proc. ISMIR , 2014, pp. 343–348. [16] N. Dehak, P . J. Kenn y , R. Dehak, P . Dumouchel, and P . Ouellet, “Front-end factor analysis for speaker veriﬁcation, ” IEEE T rans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, May 2011. [17] G. Tzanetakis and P . Cook, “Musical genre classiﬁcation of au- dio signals, ” IEEE T ransactions on Speech and Audio Pr ocessing , vol. 10, no. 5, pp. 293–302, 2002. [18] J. Park, D. Kim, J. Lee, S. Kum, and J. Nam, “ A hybrid of deep audio feature and i-vector for artist recognition, ” arXiv preprint arXiv:1807.09208 , 2018. [19] M. Mandel and D. Ellis, “Song-level features and support vector machines for music classiﬁcation, ” in Proc. ISMIR , 2005. [20] H. Fujihara and M. Goto, “L yrics-to-audio alignment and its ap- plication, ” in Dagstuhl F ollow-Ups , vol. 3. Schloss Dagstuhl- Leibniz-Zentrum fuer Informatik, 2012. [21] A. Loscos, P . Cano, and J. Bonada, “Low-delay singing voice alignment to text. ” in Pr oc. ICMC , 1999. [22] F . Eyben, M. W ¨ ollmer , and B. Schuller , “Opensmile: the Munich versatile and fast open-source audio feature extractor , ” in Proc. ACM Multimedia . A CM, 2010, pp. 1459–1462. [23] B. Schuller, S. Steidl, A. Batliner , A. V inciarelli, K. Scherer , F . Ringeval, M. Chetouani, F . W eninger , F . Eyben, E. Marchi et al. , “The interspeech 2013 computational paralinguistics chal- lenge: social signals, conﬂict, emotion, autism, ” in Pr oc. INTER- SPEECH , 2013. [24] F . A. Raposo, D. M. de Matos, and R. Ribeiro, “ An information- theoretic approach to machine-oriented music summarization, ” P attern Recognition Letters , 2019. [25] A. Alajanki, Y .-H. Y ang, and M. Soleymani, “Benchmarking music emotion recognition systems, ” PLOS ONE , pp. 835–838, 2016. [26] J. B ¨ ohm, F . Eyben, M. Schmitt, H. Kosch, and B. Schuller , “Seek- ing the superstar: Automatic assessment of perceiv ed singing quality , ” in 2017 International Joint Conference on Neural Net- works (IJCNN) . IEEE, 2017, pp. 1560–1569. [27] M. J. Carey , E. S. Parris, and H. Lloyd-Thomas, “ A comparison of features for speech, music discrimination, ” in Proc. ICASSP , vol. 1. IEEE, 1999, pp. 149–152. [28] C. Panagiotakis and G. Tziritas, “ A speech/music discriminator based on RMS and zero-crossings, ” IEEE T ransactions on Multi- media , vol. 7, no. 1, pp. 155–166, 2005. [29] H. Hermansky and N. Morgan, “Rasta processing of speech, ” IEEE T ransactions on Speech and Audio Processing , vol. 2, no. 4, pp. 578–589, 1994. [30] T . George, E. Geor g, and C. Perry , “ Automatic musical genre clas- siﬁcation of audio signals, ” in Proc. ISMIR , 2001. [31] D. Ellis, “Classifying music audio with timbral and chroma fea- tures, ” in Proc. ISMIR , 2007. [32] G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters, “Dali: A large dataset of synchronized audio, lyrics and notes, automati- cally created using teacher-student machine learning paradigm, ” in Pr oc. ISMIR , 2018. [33] J. K. Hansen, “Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefﬁcients, ” in 9th Sound and Music Computing Confer ence (SMC) , 2012, pp. 494–499. [34] D. Pov ey , A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y . Qian, P . Schwarz et al. , “The Kaldi speech recognition toolkit, ” in in Proc. ASR U , 2011. [35] D. Pove y , G. Cheng, Y . W ang, K. Li, H. Xu, M. Y armohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks, ” in Proc. INTERSPEECH , 2018, pp. 3743–3747. [36] T . Ko, V . Peddinti, D. Pove y , and S. Khudanpur, “ Audio augmen- tation for speech recognition, ” in Pr oc. INTERSPEECH , 2015, pp. 3586–3589. [37] G. Saon, H. Soltau, D. Nahamoo, and M. Pichen y , “Speaker adap- tation of neural network acoustic models using i-v ectors, ” in Proc. ASR U , Dec 2013, pp. 55–59. [38] D. Pov ey , V . Peddinti, D. Galvez, P . Ghahremani, V . Manohar, X. Na, Y . W ang, and S. Khudanpur , “Purely sequence-trained neu- ral networks for ASR based on lattice-free MMI, ” in Proc. INTER- SPEECH , 2016, pp. 2751–2755. [39] D. Stoller, S. Ewert, and S. Dixon, “W av e-u-net: A multi-scale neural network for end-to-end audio source separation, ” in Proc. ISMIR , 2018. [40] J. Li, L. Deng, Y . Gong, and R. Haeb-Umbach, “ An overvie w of noise-robust automatic speech recognition, ” IEEE/ACM T ransac- tions on Audio, Speech, and Language Processing , vol. 22, no. 4, pp. 745–777, April 2014. [41] G. Dzhambazov , “Knowledge-based probabilistic modeling for tracking lyrics in music audio signals, ” Ph.D. dissertation, Uni- versitat Pompeu Fabra, 2017.

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment