Transcribing Lyrics From Commercial Song Audio: The First Step Towards Singing Content Processing
Spoken content processing (such as retrieval and browsing) is maturing, but the singing content is still almost completely left out. Songs are human voice carrying plenty of semantic information just as speech, and may be considered as a special type…
Authors: Che-Ping Tsai, Yi-Lin Tuan, Lin-shan Lee
TRANSCRIBING L YRICS FR OM COMMERCIAL SONG A UDIO: THE FIRST STEP TO W ARDS SINGING CONTENT PR OCESSING Che-Ping Tsai ∗ , Y i-Lin T uan ∗ , Lin-shan Lee National T aiwan Uni versity Department of Electrical Engineering r06922039@ntu.edu.tw,b02901048@ntu.edu.tw, lslee@gate.sinica.edu.tw ABSTRA CT Spoken content processing (such as retrieval and browsing) is ma- turing, but the singing content is still almost completely left out. Songs are human voice carrying plenty of semantic information just as speech, and may be considered as a special type of speech with highly flexible prosody . The various problems in song au- dio, for example the significantly changing phone duration ov er highly flexible pitch contours, make the recognition of lyrics from song audio much more difficult. This paper reports an initial at- tempt to wards this goal. W e collected music-removed v ersion of English songs directly from commercial singing content. The best results were obtained by TDNN-LSTM with data augmentation with 3-fold speed perturbation plus some special approaches. The WER achieved (73.90%) was significantly lower than the baseline (96.21%), but still relati vely high. Index T erms — L yrics, Song Audio, Acoustic Model Adapta- tion, Genre, Prolonged V o wels 1. INTR ODUCTION The exploding multimedia content o ver the Internet, has cre- ated a new world of spoken content processing, for example the retriev al[1, 2, 3, 4, 5], browsing[6], summarization[1, 6, 7, 8], and comprehension[9, 10, 11, 12] of spoken content. On the other hand, we may realize there still exists a huge part of multimedia content not yet tak en care of, i.e., the singing content or those with audio including songs. Songs are human voice carrying plenty of semantic information just as speech. It will be highly desired if the huge quantities of singing content can be similarly retrieved, browsed, summarized or comprehended by machine based on the lyrics just as speech. For example, it is highly desired if song retriev al can be achie ved based on the lyrics in addition. Singing voice can be considered as a special type of speech with highly flexible and artistically designed prosody: the rhythm as artistically designed duration, pause and energy patterns, the melody as artistically designed pitch contours with much wider range, the lyrics as artistically authored sentences to be uttered by the singer . So transcribing lyrics from song audio is an extended version of automatic speech recognition (ASR) taking into account these differences. On the other hand, singing v oice and speech dif fer widely in both acoustic and linguistic characteristics. Singing signals are often accompanied with some extra music and harmony , which are noisy for recognition. The highly flexible pitch contours with much wider range[13, 14], the significantly changing phone du- rations in songs, including the prolonged vo wels[15, 16] o ver smoothly varying pitch contours, create much more problems not existing in speech. The falsetto in singing v oice may be an extra type of human voice not present in normal speech. Regarding ∗ indicates equal contribution. linguistic characteristics[17, 18], word repetition and meaningless words (e.g.oh) frequently appear in the artistically authored lyrics in singing voice. Applying ASR technologies to singing voice has been studied for long. Howe ver , not too much work has been reported, prob- ably because the recognition accuracy remained to be relati vely low compared to the experiences for speech. But such lo w accu- racy is actually natural considering the v arious dif ficulties caused by the significant differences between singing voice and speech. An extra major problem is probably the lack of singing voice database, which pushed the researchers to collect their own closed datasets[13, 16, 18], which made it difficult to compare results from different w orks. Having the language model learned from a data set of lyrics is definitely helpful[16, 18]. Hosoya et al.[17] achie ved this with fi- nite state automaton. Sasou et al.[13] actually prepared a language model for each song. In order to cope with the acoustic character- istics of singing voice, Sasou et al.[13, 15] proposed AR-HMM to take care of the high-pitched sounds and prolonged v o wels, while recently Kaw ai et al.[16] handled the prolonged v owels by extend- ing the v owel parts in the lexicon, both achie ving good improve- ment. Adaptation from models trained with speech was attractiv e, and various approaches were compared by Mesaros el al.[19]. In this paper, we wish our work can be compatible to more av ailable singing content, therefore in the initial effort we col- lected about fi ve hours of music-remo ved v ersion of English songs directly from commercial singing content on Y ouT ube. The de- scriptiv e term ”music-r emoved” implies the background music hav e been removed somehow . Because many very impressive works were based on Japanese songs[13, 14, 15, 16, 17], the com- parison is difficult. W e analyzed various approaches with HMM, deep learning with data augmentation, and acoustic adaptation on fragment, song, singer , and genre le vels, primarily based on fMLLR[20]. W e also trained the language model with a corpus of lyrics, and modify the pronunciation lexicon and increase the transition probability of HMM for prolonged v owels. Initial results are reported. 2. D A T AB ASE 2.1. Acoustic Corpus T o make our w ork easier and compatible to more av ailable singing content, we collected 130 music-remov ed (or v ocal-only) English songs from www .youtube.com so as to consider only the vocal line.The music-removing processes are conducted by the video owners, containing the original v ocal recordings by the singers and vocal elements for remix purpose. 1 After initial test by speech recognition system trained with LibriSpeech[21], we dropped 20 songs, with WERs exceeding 1 Samples of our collected data: https://youtu.be/QA6x9MLgsc8 # songs # singers pop electronic T raining set 95 49 202.2 85.8 T esting set 15 13 20.3 22.0 rock hiphop R&B/soul total T raining set 51.1 30.0 87.5 271 T esting set 17.7 8.4 9.1 42.8 T able 1 . Information of training and testing sets in v ocal data. The lengths are all measured in minutes. 95%. The remaining 110 pieces of music-removed version of commercial English popular songs were produced by 15 male singers, 28 female singers and 19 groups. The term gr oup means by more than one person. No any further preprocessing was per- formed on the data, so the data preserves many characteristics of the vocal extracted from commercial polyphonic music, such as harmony , scat, and silent parts. Some pieces also contain overlap- ping verses and residual background music, and some frequency components may be truncated. Belo w this database is called vocal data here. These songs were manually segmented into fragments with du- ration ranging from 10 to 35 sec primarily at the end of the verses. Then we randomly divided the vocal data by the singer and split it into training and testing sets. W e got a total of 640 fragments in the training set and 97 fragments in the testing set. The singers in the two sets do not ov erlap. The details of the vocal data are listed in T able.1. Because music genre may affect the singing style and the au- dio, for example, hiphop has some rap parts, and rock has some shouting vocal, we obtained fiv e frequently observed genre la- bels of the v ocal data from wikipedia[22] : pop, electronic, rock, hiphop, and R&B/soul. The details are also listed in T able.1. Note that a song may belong to multiple genres. T o train initial models for speech for adaptation to singing voice, we used 100 hrs of English clean speech data of Lib- riSpeech. 2.2. Linguistic Corpus In addition to the data set from LibriSpeech (803M words, 40M sentences), we collected 574k pieces of lyrics text (totally 129.8M words) from lyrics.wikia.com , a lyric website, and the lyrics were normalized by removing punctuation marks and unneces- sary words (like [CHOR US]). Also, those lyrics for songs within our vocal data were remo ved from the data set. 3. RECOGNITION APPR O A CHES AND SYSTEM STR UCTURE Fig.1 shows the overall structure based on Kaldi[23] for training the acoustic models used in this work. The right-most block is the vocal data, and the series of blocks on the left are the feature extrac- tion processes over the v ocal data. Features I, II, III, IV represent four different v ersions of features used here. For example, Feature IV was derived from splicing Feature III with 4 left-context and 4 right-context frames, and Feature III was obtained by performing fMLLR transformation over Feature II, while Feature I has been mean and variance normalized, etc. The series of second right boxes are forced alignment pro- cesses performed ov er the various v ersions of features of the vo- cal data. The results are denoted as Alignment a, b, c, d, e. For example, Alignment a is the forced alignment results obtained by aligning Feature I of the v ocal data with the LibriSpeech SA T tri- phone model (denoted as Model A at the top middle). The series of blocks in the middle of Fig.1 are the different versions of trained acoustic models. For example, model B is a Fig. 1 . The ov erall structure for training the acoustic models. monophone model trained with Feature I of the vocal data based on alignment a. Model C is very similar , e xcept based on alignment b which is obtained with Model B, etc. Another four sets of Models E, F , G, H are below . For example Model E includes models E-1, 2, 3, 4, Models F ,G and H include F-1,2 , G-1,2,3, and H-1,2,3. W e take Model E-4 with fragment-level adaptation within model E as the e xample. Here every fragment of song (10-35 sec long) was used to train a distinct fragment-level fMLLR matrix, with which Feature III was obtained. Using all these fragment- lev el fMLLR features, a single Model E-4 was trained with Align- ment d. Similarly for Models E-1, 2, 3 on genre, singer and song lev els. The fragment-level Model E-4 turned out to be the best in model E in the experiments. 3.1. DNN, BLSTM and TDNN-LSTM The deep learning models (Models F ,G,H) are based on align- ment e, produced by the best GMM-HMM model. Models F-1,2 are respectiv ely for regular DNN and multi-target, LibriSpeech phonemes and vocal data phonemes taken as two targets. The latter tried to adapt the speech model to the vocal model, with the first sev eral layers shared, while the final layers separated. Data augmentation with speed perturbation[24] was imple- mented in Models G, H to increase the quantity of training data and deal with the problem of changing singing rates. For 3-fold, two copies of e xtra training data were obtained by modifying the audio speed by 0.9 and 1.1. For 5-fold, the speed factors were em- pirically obtained as 0.9, 0.95, 1.05, 1.1. 1-fold means the original training data. Models G-1,2,3 used projected LSTM (LSTMP)[25] with 40 dimension MFCCs and 50 dimension i-v ectors with output delay of 50ms. BLSTMs were used at 1-fold, 3-fold and 5-fold. Models H-1,2,3 used TDNN-LSTM[26], also at 1-fold, 3-fold and 5-fold, with the same features as Model G. Fig. 2 . Approaches for prolonged vo wels: (a) e xtended lexicon (vo wels can be repeated or not), (b) increased self-loop transition probabilities (transition probabilities to the next state reduced by r ). 3.2. Special Appr oaches f or Prolonged V owels Consider the many errors caused by the frequently appearing pro- longed vowels in song audio, we considered two approaches below . 3.2.1. Extended Lexicon The previously proposed approach [16] was adopted here as sho wn by the example in Fig.2(a). For the word “apple”, each vowel within the word ( but not the consonants) can be either repeated or not, so for a word with n vo wels, 2 n pronunciations become possible. In the experiments below , we only did it for words with n ≤ 3 . 3.2.2. Incr eased Self-looped T ransition Pr obabilities This is also shown in Fig.2. Assume an vo wel HMM have m + 1 states (including an end state). Let the original self-looped proba- bility of state i is denoted 1 − p i and the probability of transition to the next state is p i , i = 1 , 2 , ..., m . W e increased the self-looped transition probabilities by replacing p i by r p i . This was also done for vo wel HMMs only but not for consonants. 4. EXPERIMENTS 4.1. Data Analysis Fig. 3 . Histogram of pitch distribution. 4.1.1. Language Model (LM) statistics W e analyzed the perplexity and out-of-vocab ulary(OO V) rate of the two language models (trained with LibriSpeech and L yrics re- spectiv ely) tested on the transcriptions of the testing set of vocal data. Both models are 3-gram, pruned with SRILM with the same threshold. LM trained with lyrics was found to hav e a significantly lower perplexity(123.92 vs 502.06) and a much lower OO V rate (0.55% vs 1.56%). Acoustic Models WER(%) PER(%) Libri Speech LM (1) Model A: LibriSpeech(SA T) 96.21 87.17 (2) Model E-4: fragment-lev el 88.26 77.18 (3) Model E-4: fragment-lev el 80.40 68.80 L yrics Language Model Extended Lexicon (4) Model B: Monophone 86.57 76.10 (5) Model C: T riphone 81.58 71.11 (6) Model D: T riphone 82.02 72.10 (7) Model E-4: fragment-lev el 77.08 66.04 (8) Model E-4: fragment-lev el +Increased T rans. Prob . 76.62 65.79 (9) Model F-1 DNN (regular) 75.56 65.64 (10) Model F-2 DNN (multi-target) 75.84 65.56 (11) Model G-1 BLSTM (1-fold) 79.94 70.27 (12) Model G-2 BLSTM (3-fold) 74.32 63.86 (13) Model G-3 BLSTM (5-fold) 75.35 65.50 (14) Model H-1 TDNN-LSTM (1-fold) 79.01 69.20 (15) Model H-2 TDNN-LSTM (3-fold) 73.90 64.33 (16) Model H-3 TDNN-LSTM (5-fold) 74.53 63.70 T able 2 . W ord error rate (WER) and phone error rate (PER) o ver the test set of vocal data. 4.1.2. Pitch Distribution Fig.3 depicts the histogram for pitch distrib ution for speech and different genders of vocal. W e can see the pitch v alues of v ocal are significantly higher with a much wider range, and female singers produce slightly higher pitch values than male singers and groups. 4.2. Recognition Results The primary recognition results are listed in T able.2. W ord er- ror rate (WER) is taken as the major performance measure, while phone error rate (PER) is also listed as references. Rows (1)(2) on the top are for the language model trained with LibriSpeech data, while rows (3)-(16) for the language model trained with lyrics corpus. In addition, in rows (4)-(16) the lexicon was extended with possible repetition of vo wels as explained in subsection 3.2.1. Rows (1)-(8) are for GMM-HMM only , while rows (9)-(16) with DNNs, BLSTMs and TDNN-LSTMs. Row(1) is for Model A in Fig.1 taken as the baseline, which was trained on LibriSpeech data with SA T , together with the lan- guage model also trained with LibriSpeech. The extremely high WER (96.21%) indicated the wide mismatch between speech and song audio, and the high difficulties in transcribing song audio. This is taken as the baseline of this work. After going through the series of Alignments a, b, c, d and training the series of Models B, C, D, we finally obtained the best GMM-HMM model, Model E-4 in Model E with fMLLR on the fragment le vel, as e xplained in section 3 and shown in Fig.1. As sho wn in row(2) of T able.2, with the same LibriSpeech LM, Model E-4 reduced WER to 88.26%, Fig. 4 . Sample recognition errors produced by Model E-4 : fragment-level in ro w(7) of T able.2. and brought an absolute improvement of 7.95% (rows (2) vs. (1)), which shows the achie vements by the series of GMM-HMM alone. When we replaced the LibriSpeech language model with L yrics language model but with the same Model E-4, we obtained an WER of 80.40% or an absolute improvement of 7.86% (ro ws (3) vs. (2)). This shows the achiev ement by the L yrics language model alone. W e then substituted the normal lexicon with the extended one (with vowels repeated or not as described in subsection 3.2.1), while using exactly the same model E-4, the WER of 77.08% in row (7) indicated the e xtended lexicon alone brought an absolute improv ement of 3.32% (ro ws (7) vs. (3)). Furthermore, the in- creased self-looped transition probability ( r = 0 . 9 ) in subsection 3.2.2 for vo wel HMMs also brought an 0.46% improvement when applied on top of the extended le xicon (ro ws (8) vs. (7)). The results sho w that prolonged v owels did cause problems in recogni- tion, and the proposed approaches did help. Rows (4)(5)(6) for Models B, C, D sho w the incremental im- prov ements when training the acoustic models with a series of im- prov ed alignments a, b, c, which led to the Model E-4 in row (7). Some preliminary tests with p-norm DNN with v arying parameters were then performed. The best results for the moment were ob- tained with 4 hidden layers, 600 and 150 hidden units for p-norm nonlinearity[27]. The result in ro ws (9) shows absolute improve- ments of 1.52% (row (9) for Model F-1 vs. row (7)) for regular DNN. Rows(10) is for Models F-1 DNN (multi-tar get). Rows (11)(12)(13) show the results of BLSTMs with differ- ent factors of data augmentation described in 3.1. Models G-1,2,3 used three layers with 400 hidden states and 100 units for recur- rent and projection layer , howe ver , since the amount of training data were dif ferent, the number of training epoches were 15, 7 and 5 respectively . Data augmentation brought much impro vement of 5.62% (rows (12) v .s.(11)), while 3-fold BLSTM outperformed 5- fold by 1.03%. Trend for Model H (ro ws (14)(15)(16)) is the same as Model G, 3-fold turned out to be the best. Row (15) of Model TDNN-LSTM achiev ed the lowest WER(%) of 73.90%, with ar- chitecture T 130 T 130 L 130 T 520 T 520 L 130 T 520 T 520 L 130 , while T n and L m denotes that the size of TDNN layer was n and the size of hidden units of forward LSTM was m . The WER achieved here are relativ ely high, indicating the difficulties and the need for further research. 4.3. Different Levels of fMLLR Adaptation In Fig.1 Model E includes different models obtained with fMLLR ov er different le vels, Models E-1,2,3,4. But in T able.2 only Model E-4 is listed. Complete results for Models E-1,2,3,4 are listed in T a- ble.3, all for L yrics Language Model with e xtended lexicon. Row Acoustic Model WER(%) PER(%) L yrics Language Model Extended Lexicon (1) Model E-1, genre-lev el 84.24 68.92 (2) Model E-2, singer-le vel 78.53 68.48 (3) Model E-3, song-lev el 78.80 68.24 (4) Model E-4, fragment-lev el 77.08 66.04 T able 3 . Model E : GMM-HMM with fMLLR ov er different lev els. (4) here is for Model E-4, or fMLLR over fragment le vel, exactly row (7) of T able.2. Rows (1)(2)(3) are the same as row (5) here, ex- cept over levels of genre, singer and song. W e see fragment le vel is the best, probably because fragment(10-35 sec long) is the smallest unit and the acoustic characteristic of signals within a fragment is almost uniform (same genre, same singer and the same song). 4.4. Error Analysis From the data, we found errors frequently occurred under some specific circumstances, such as high-pitched v oice, widely v arying phone duration, overlapping verses (multiple people sing simulta- neously), and residual background music. Figure 4 sho ws a sample recognition results obtained with Model E-4 as in row(7) of T able.2, sho wing the error caused by high-pitched voice and overlapping verses. At first, the model successfully decoded the w ords, ”what doesn’t kill you makes” , but afterward the pitch went high and a lo wer pitch harmon y w as added, the recognition results then went totally wrong. 5. CONCLUSION In this paper we report some initial results of transcribing lyrics from commercial song audio using different sets of acoustic mod- els, adaptation approaches, language models and le xicons. T ech- niques for special characteristics of song audio were considered. The achiev ed WER was relati vely high compared to e xperiences in speech recognition. Howe ver , considering the much more difficult problems in song audio and the wide difference between speech and singing voice, the results here may serve as good references for future work to be continued. 6. REFERENCES [1] Lin-shan Lee, James Glass, Hung-yi Lee, and Chun-an Chan, “Spoken content retrie v al-beyond cascading speech recogni- tion with te xt retrieval, ” IEEE/A CM T ransactions on A udio, Speech, and Language Pr ocessing , vol. 23, no. 9, pp. 1389– 1420, 2015. [2] Ciprian Chelba, Timothy J Hazen, and Murat Saraclar , “Re- triev al and bro wsing of spoken content, ” IEEE Signal Pr o- cessing Magazine , v ol. 25, no. 3, 2008. [3] Martha Larson, Gareth JF Jones, et al., “Spoken content re- triev al: A survey of techniques and technologies, ” F ounda- tions and T r ends R in Information Retrieval , vol. 5, no. 4–5, pp. 235–422, 2012. [4] Anupam Mandal, KR Prasanna Kumar , and Pabitra Mitra, “Recent developments in spoken term detection: a surve y , ” International Journal of Speech T echnology , vol. 17, no. 2, pp. 183–198, 2014. [5] Hung-Y i Lee and Lin-Shan Lee, “Impro ved semantic re- triev al of spoken content by document/query expansion with random walk over acoustic similarity graphs, ” IEEE/A CM T r ansactions on Audio, Speech, and Language Pr ocessing , vol. 22, no. 1, pp. 80–94, 2014. [6] Lin-shan Lee and Berlin Chen, “Spoken document under- standing and organization, ” IEEE Signal Processing Maga- zine , vol. 22, no. 5, pp. 42–60, 2005. [7] Sz-Rung Shiang, Hung-yi Lee, and Lin-shan Lee, “Super- vised spoken document summarization based on structured support vector machine with utterance clusters as hidden vari- ables., ” in INTERSPEECH , 2013, pp. 2728–2732. [8] Hung-yi Lee, Y u-yu Chou, Y ow-Bang W ang, and Lin-shan Lee, “Unsupervised domain adaptation for spoken document summarization with structured support vector machine, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2013 IEEE International Conference on . IEEE, 2013, pp. 8347– 8351. [9] Bo-Hsiang Tseng, Sheng-syun Shen, Hung-Y i Lee, and Lin- Shan Lee, “T owards machine comprehension of spok en con- tent: Initial TOEFL listening comprehension test by ma- chine, ” Interspeech 2016 , pp. 2731–2735, 2016. [10] W ei F ang, Juei-Y ang Hsu, Hung-yi Lee, and Lin-Shan Lee, “Hierarchical attention model for improved machine compre- hension of spoken content, ” in Spoken Language T echnology W orkshop (SLT), 2016 IEEE . IEEE, 2016, pp. 232–238. [11] Hung-yi Lee, Sz-Rung Shiang, Ching-feng Y eh, Y un-Nung Chen, Y u Huang, Sheng-Y i Kong, and Lin-shan Lee, “Spo- ken knowledge org anization by semantic structuring and a prototype course lecture system for personalized learning, ” IEEE/A CM T ransactions on Audio, Speech and Language Pr ocessing (T ASLP) , v ol. 22, no. 5, pp. 883–898, 2014. [12] Sheng-syun Shen, Hung-yi Lee, Shang-wen Li, V ictor Zue, and Lin-shan Lee, “Structuring lectures in massive open on- line courses (moocs) for efficient learning by linking similar sections and predicting prerequisites., ” in INTERSPEECH , 2015, pp. 1363–1367. [13] Akira Sasou, Masataka Goto, Satoru Hayamizu, and Kazuyo T anaka, “ An auto-regressiv e, non-stationary excited signal parameter estimation method and an ev aluation of a singing- voice recognition, ” in Acoustics, Speech, and Signal Pr o- cessing, 2005. Pr oceedings.(ICASSP’05). IEEE International Confer ence on . IEEE, 2005, vol. 1, pp. I–237. [14] Dairoku Kawai, Kazumasa Y amamoto, and Seiichi Naka- gaw a, “L yric recognition in monophonic singing using pitch- dependent DNN, ” . [15] Akira Sasou, “Singing voice recognition considering high- pitched and prolonged sounds, ” in Signal Processing Confer- ence, 2006 14th European . IEEE, 2006, pp. 1–4. [16] Dairoku Kawai, Kazumasa Y amamoto, and Seiichi Naka- gaw a, “Speech analysis of sung-speech and lyric recognition in monophonic singing, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2016 IEEE International Conference on . IEEE, 2016, pp. 271–275. [17] T oru Hosoya, Motoyuki Suzuki, Akinori Ito, Shozo Makino, Lloyd A Smith, David Bainbridge, and Ian H W itten, “L yrics recognition from a singing voice based on finite state automa- ton for music information retriev al., ” in ISMIR , 2005, pp. 532–535. [18] Annamaria Mesaros and T uomas V irtanen, “Recognition of phonemes and words in singing, ” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Con- fer ence on . IEEE, 2010, pp. 2146–2149. [19] Annamaria Mesaros and Tuomas V irtanen, “ Adaptation of a speech recognizer for singing voice, ” in Signal Process- ing Conference, 2009 17th Eur opean . IEEE, 2009, pp. 1779– 1783. [20] Mark JF Gales, “Maximum likelihood linear transformations for HMM-based speech recognition, ” Computer speech & language , v ol. 12, no. 2, pp. 75–98, 1998. [21] V assil Panayotov , Guoguo Chen, Daniel Povey , and Sanjeev Khudanpur , “Librispeech: an ASR corpus based on public domain audio books, ” in Acoustics, Speech and Signal Pr o- cessing (ICASSP), 2015 IEEE International Confer ence on . IEEE, 2015, pp. 5206–5210. [22] Wikipedia, “Plagiarism — Wikipedia, the free encyclope- dia, ” 2004, [Online; accessed 22-July-2004]. [23] Daniel Povey , Arnab Ghoshal, Gilles Boulianne, Lukas Bur- get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit, ” in IEEE 2011 workshop on au- tomatic speech r ecognition and understanding . IEEE Signal Processing Society , 2011, number EPFL-CONF-192584. [24] T om K o, V ijayaditya Peddinti, Daniel Pove y , and Sanjeev Khudanpur , “ Audio augmentation for speech recognition., ” in INTERSPEECH , 2015. [25] Has ¸ im Sak, Andrew Senior, and Franc ¸ oise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling, ” in F ifteenth Annual Con- fer ence of the International Speech Communication Associ- ation , 2014. [26] V ijayaditya Peddinti, Y iming W ang, Daniel Pove y , and San- jeev Khudanpur , “Low latenc y acoustic modeling using tem- poral con v olution and LSTMs, ” IEEE Signal Pr ocessing Let- ters , 2017. [27] Xiaohui Zhang, Jan Trmal, Daniel Povey , and Sanjeev Khu- danpur , “Impro ving deep neural network acoustic models us- ing generalized maxout networks, ” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Con- fer ence on . IEEE, 2014, pp. 215–219. [28] Annamaria Mesaros and T uomas V irtanen, “ Automatic recognition of lyrics in singing, ” EURASIP Journal on Audio, Speech, and Music Pr ocessing , vol. 2010, no. 1, pp. 546047, 2010. [29] Anna M Kruspe and IDMT Fraunhofer , “Bootstrapping a system for phoneme recognition and ke yword spotting in un- accompanied singing, ” in 17th International Confer ence on Music Information Retrieval (ISMIR), New Y ork, NY , USA , 2016.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment