The fifth CHiME Speech Separation and Recognition Challenge: Dataset, task and baselines

The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Ch…

Authors: Jon Barker, Shinji Watanabe (CLSP), Emmanuel Vincent (MULTISPEECH)

The fifth ‘CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines J on Barker 1 , Shinji W atanabe 2 , Emmanuel V incent 3 , and J an T rmal 2 1 Univ ersity of Sheffield, UK 2 Center for L ang u age and Speech Processing, Johns Hopki ns Univ ersity , Baltimore, USA 3 Univ ersit ´ e de Lorraine, CNRS, Inria, L ORIA, F-54000 Nancy , France j.p.barker@sh effield.ac.uk , shinjiw@jhu.e du, emman uel.vincent@i nria.fr, jtrmal@gmai l.com Abstract The CHiME challenge series aims to adv ance robust autom atic speech recognition (A S R) technology by promoting research at the interface of speech and language processing, signal pro- cessing, and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi- microphone con v ersational ASR in real home en vironments. Speech material w as elicited using a dinner party scenario with effo rts taken to capture data that is representativ e of natural con v ersational speech and recorded by 6 Kinect microphon e arrays and 4 binaural microphon e pairs. The challenge f eatures a single-array track and a multiple-array track and, for each track, distinct rankings will be produced for systems focusing on robustness with respect to distant-microphone capture vs. systems attempting to address all aspects of the task including con v ersational l anguage modeling. W e discuss the rationale for the challenge and provide a de tailed des cription of the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and con v entional and end-to-end ASR. Index T erms : Robust ASR, noise, re verberation , con v ersa- tional speech, microphone array , ‘CHiME’ challenge. 1. Intr odu ction Automatic speech recognition (ASR) performance in difficult rev erberan t and noisy conditions has improved tremendou sly in the last decad e [1–5]. T his can be att ributed to adv ances in speech processing, audio enhancement, and machine learning, but also to t he av ailability of real speech corpora recorded in cars [6, 7], quiet indoor en vironments [8, 9], noisy indoor and outdoor en viron ments [ 10 , 1 1], and challenging broadcast me- dia [12, 13]. Among the applications of robust ASR, voice com- mand in do mestic en vironments has attracted much interest re- cently , due in particular to the release of Amazon Echo, Google Home and other devices targeting home automation and mul- timedia systems. T he CHiME- 1 [14] and CHiME-2 [15] chal- lenges and corpora have contrib uted to popularizing research on this topic, together with the DICIT [16], Sw eet-H ome [17], and DIRHA [18] corp ora. These corpora feature single-spe aker re- verberan t and/or noisy speech recorded or si mulated in a single home, which precludes the use of modern speech enhancement techniques based on machine learning. The recently released voiceHo me corpus [19] addresses this issue, but the amount of data remains fairly small. In parallel to research on acoustic robustness, r esearch on con v ersational speech r ecognition has also made great progress, as illustrated by t he recent announcements of super-human per- formance [20, 21] achiev ed on the Switchboard telepho ne con - versation task [22] and by the ASpIRE challenge [23]. Distant- microphone recognition of noisy , overlapping , con v ersational speech is now widely belie ved to be the next frontier . Early attempts in this direction can be tr aced back to the ICSI [24], CHIL [25], and AMI [26] meeting corpora, the LLSEC [27] and COSINE [28] face-to-fac e interaction corpora, and the Sheffield W argames corpus [29]. These corpora were recorded using ad- v anced microphone array prototypes which are not commer- cially av ailable, and as result could only be installed in a fe w laboratory rooms. T he Santa Barbara Corpus of Spoken Amer- ican E nglish [30] st ands out as the only large-scale corpus of naturally occurring spoken i nteractions between a wide v ariety of people recorded in real ev eryda y situations including face-to- face or telephon e con v ersations, card games, food preparation, on-the-job talk, story-telling, and more. Unfortunately , it was recorded via a single microphone. The CHiME-5 C hallenge aims to bridge the gap between these attempts by providing the first large-scale corpus of real multi-speaker con versationa l speech recorded via commer- cially av ailable multi-microphone hardware in multiple homes. Speech material was elicited using a 4-people dinner party sce- nario and recorded by 6 distant Ki nect microph one arrays and 4 binaural microphone pairs in 20 homes. T he challenge fea- tures a single-array track and a multiple-array track. Distinct rankings will be produced for systems focusing on acoustic ro- bustne ss vs. systems aiming to address all aspects of the t ask. The paper is structured as follo ws. Sections 2 and 3 de- scribe the data collection procedure and the task to be solved . Section 4 presents the baseline systems for array synch roniza- tion, spe ech enhancem ent, and ASR and the corresponding re- sults. W e conclude in Section 5. 2. Dataset 2.1. Th e scenario The dataset is made up of the recording of twenty separate din- ner parties taking place in real homes. Each dinner party has four participants - two acting as hosts and two as guests. The party members are all friends who kno w each other well and who are instructed t o behav e naturally . Efforts have been tak en to mak e the parties as natural as possible. The only constraints are that each party should l ast a minimum of 2 hours and should be co mposed of t hree phases, each corresp onding to a different loca tion: i) kitch en – prepar- ing the meal in the kitchen area; ii) dining – eating the meal in the dining area; iii) living – a post-dinner period in a separate living room area. Participants were allowed to move naturally from one loca- tion to another but with t he i nstruction that each phase should last at least 30 minutes. Participants were left free to con verse on any topics of their choosing. Some personally identifying material was redacted post-recording as part of the consent pro- cess. Backgroun d television and commercial music were disal- lo wed in order to avo id capturing copyrigh ted content. 2.2. A u dio Each party has been recorded with a set of six Microsoft Kinect de vices. The devices have been strategically placed such that there are always at least t wo capturing the activity in each lo- cation. E ach Kinect de vice has a linear array of 4 sample- synchronised microphones and a camera . The raw microphone signals and video have been recorded. Each Kinect is recorded onto a separate laptop computer . Floor plans were drafted to record the layout of the living space and the approximate loca- tion and orientation of each Kinect dev i ce. In addition to the Kinects, to facilitate transcription, each participant wore a set of Soundman OKM I I Cl assic Studio bin- aural mi crophones. The audio from these was recorded via a Soundman A3 adapter onto T ascam DR-05 stereo recorders also being worn by the participants. 2.3. T ranscriptions The parties hav e been fully transcribed. For each speaker a ref- erence transcription is constructed in which, for each utterance produced by that speak er , the st art and end times and the word sequence are manually obtained by listening to the speak er’ s binaural recording (the reference signal). For each other record- ing device , the utterance’ s start and end time are produced by shifting t he reference timings by an amou nt that compensates for the a synchrony between dev ices (see Section 4.1). The transcriptions can also contain the followings tags: [ noise ] deno t ing any non-language noise mad e by the speak er (e.g., cough ing, loud chewing, etc.); [ inaudible ] denoting speech t hat is not clear enough to be transcribed; [ la ug hs ] de- noting instances of laughter; [ r edacted ] are parts of the signals that have been zeroed out for priv acy reasons. 3. T ask 3.1. T raining, d evelopment, and ev aluation sets The 20 parties have been divided into disj oint training, de velop- ment and ev aluation sets as summarised in T able 1. There is no ov erlap between the speakers in each set. T able 1: Overvie w of CHiME- 5 datasets Dataset Parties Speak ers Hours Utterances T rain 16 32 40:33 79,980 De v 2 8 4:27 7,440 Eva l 2 8 5:12 11,028 For the dev elopment and ev aluation data, the transcription file also contains a speaker l ocation and ‘reference’ array for each utterance. The location can be either ‘kitchen’,‘dining room’, or ‘liv ing room’ and the reference array (the target for speech r ecognition) i s chosen to be one that is situated in the same area. 3.2. T racks and ranking The challenge features two tracks: • single-array : only the reference array can be used to recognise a given ev aluation utterance, • multiple-array : all arrays can be used. For each track, two separate r ankings will be produced: • Ranking A – systems based on con ventiona l acous- tic modeling and using the supplied official language model: the outputs of the acoustic model must r emain frame-lev el tied phonetic (senone) t argets and the lexi- con and language model must not be modified, • Ranking B – all other systems, e.g., including systems based on end-to-end processing or systems whose lexi- con and/or language model hav e been modified. In other words, rank ing A focuses on acoustic robus tness only , while ranking B addresses all aspects of the task. 3.3. Instru ctions A set of i nstructions has been provid ed that ensure that systems are broadly comparable and that participants r espect the appli- cation scenario. In particular , systems are allowed to e xploit kno wledge of the utterance start and end time, the utterance speak er label and the speak er location label. During e valua- tion participants can use the entire session recording from the reference array (for the single-array track) or from all arr ays (for multiple-array track), i.e., one can use the past and future acoustic contex t surrounding the utterance to be recognised. For training and dev elopment, participants are also provided with the binaura l microphon e signals and the fl oor plans. Partici- pants are forbidden from manual modification of the data or t he annotations (e.g., manual refinement of the utterance timings or transcriptions). It is required that all parameters are tuned on the training set or dev el opment set. Participants can ev aluate differen t versions of their system but the final submission will be the one that per- forms best on the dev elopment data, and this will be ranked ac- cording to its performance on the ev aluation data. While, some modifications of the deve lopment set are necessarily allowe d (e.g. automatic signal enhancement, or refinement of utterance timings), participants ha ve been cautioned against techniques designed to fit the de velopment data to the ev aluation data (e.g. by selecting subsets, or systematically v arying its pitch or level). These “biased” transformations are forbidden. The challenge has been designed to promote research cov- ering all stages in the recognition pip eline. Hence, participants are free to replace or improv e an y compo nent of the baseline system, or e ven to replace the entire baseline with their o wn systems. Howe ver , the architecture of the system will deter- mine whether a participant’ s result is ranked in category A or category B (see Section 3.2 ). Participants will ev aluate their own systems and will be asked to return ov erall W ERs f or the de velopment and ev alu- ation data, plus W E Rs broken down by session and location. They will also be asked to submit the corresponding l att ices in Kaldi format to allow their scores t o be valida ted, plus a techni- cal description of their system. 4. Baselines 4.1. Array synchronization While signals r ecorded by the same device are sample- synchrono us, there is no precise synchronisation between de- vices. Across de vices synchronisation cannot be guaranteed. The signal start times are approximately synchronised post- recording using a synch ronisation tone that was played at the beginn ing of each recording session. Ho we ver , de vices can drift out of synchrony due to small v ariations in clock speed (clock drift) and due to frame dropping. T o correct for this a cross- correlation approach is used to estimate the delay between one of the binaural recorders chosen as the reference and all other de vices [31]. These delays are estimated at regu lar 10 second interv als throughout the recording. Using the delay estimates, separate utterance st art and end times have been computed for each device and are recorded in the JSON tr anscription files. 4.2. Sp eech enhancement CHiME-5 uses a weighted de lay-and-sum beamformer (Beam- formIt [32]) as a def ault multichannel spe ech enhancement ap- proach, similar to the CHiME-4 recipe [11]. Beamforming is performed by using four microphone signals attached to the ref- erence array . The reference array i nformation is provided by the organ izers through the JS ON transcription file. 4.3. Conv ent ional ASR The con ventional AS R baseline is distributed through the Kaldi github repository [33] 1 and is described in brief below . 4.3.1. Data pre para tion (stage 0 and 1) These stages provide Kaldi-format data directories, lexicons, and language models. W e use a CMU dictionary 2 as a basic pronunciation dictionary . Ho we ver , since the CHiME-5 con ver - sations are spontaneo us speech and a nu mber of words are not present in the CMU dictionary , we use grapheme to phoneme con version based on Phonetisaurus G2P [34] 3 to provide the pronunciations of these OO V ( out-of-vocabu lary) words. The language model is selected automatically , based on perplexity on t raining data, but at the time of the writ ing, the selected L M is 3-gram t r ai ned by the MaxEnt modeling method as imple- mented in the SRILM toolkit [35–37]. The total vo cabulary size is 128K augmented by t he G2P process mentioned above. 4.3.2. Enhancement (stage 2) This stage calls BeamformIt based spee ch enha ncement, as in- troduced in Section 4.2. 4.3.3. F eatur e extraction an d data arrang ement ( stag e 3-6) These stages include MFCC-based feature extraction for GMM training, and training data preparation (250k utterances, in data/train_wo rn_u100k ) The training data combines both left and ri ght channels (150k utt erances) of the binaural mi- crophone data ( dat a/train_worn ) and a subset (100k utter- ances) of all Kinect microphone data ( data/train _u100k ). Note t hat we observed some performance impro vements when we use larger amounts of training data instead of the abov e sub- set. Ho wev er, we have limited the size of the data in the baseline 1 https://github .com/kaldi- asr/kaldi/tree/master/egs/chime5/s5 2 http://www.spe ech.cs.cmu.edu/c gi- bin/cmudict 3 https://github .com/AdolfVonKle ist/Phonetisaurus so that experiments can be run without requiring unre asonable computational resources. 4.3.4. HMM/GMM (stage 7-16) T raining and recognition are performed with a hidden Marko v model (HMM) / Gaussian mixture model (GMM) system. The GMM st ages include standard triphone-b ased acoustic mode l building with variou s feature t r ansformations including li near discriminant analysis (LD A), maximum likelihood linear trans- formation (MLL T), and feature space maximum likelihood lin- ear regression (fMLLR) with speaker adapti ve training (S A T). 4.3.5. Data cleanup (stage 17) This stage remov es sev eral i rregular utterances, which improv es the final performance of the system [38]. T otally 15% of ut- terances in the training data are excluded due to this cleaning process, which yields consistent improv ement in the following LF-MMI TDNN training. 4.3.6. LF -MMI TDNN (stag e 18) This is an advance d time-delayed neural network (T DNN) base- line using latti ce-free maximum mutual information (LF-MMI) training [39]. This baseline requires much l arger computational resources: multiple GPUs for TDNN training (18 hours with 2-4 GPUs), many CPUs for i-vector and l at t ice generation, and large storage space for data augmentation (speed perturbation). As a summary , compared with the previo us CHiME-4 base- line [11], the CHiME-5 baseline i ntroduces: 1) grapheme to phoneme con version; 2) D at a cleaning up; 3) Lat t ice free MMI training. With these techniques, we can provide a reasonable ASR baseline for this challenging task. 4.4. En d -to-end ASR CHiME-5 also pro vides an end-to-end ASR baseline based on ESPnet 4 , which uses Chainer [40] and PyT orch [41], as its un - derlying deep learning engine. 4.4.1. Data pre para tion (stage 0) This i s the same as t he Kaldi data directory preparation, as dis- cussed in Section 4.3.1. Howe ver , the end-to-end AS R base- line does not require lexico n generation and FS T preparation. This stage also includes beamforming based on the Beamfor- mIt toolkit, as introduced in S ection 4.2. 4.4.2. F eatur e extraction (sta ge 1) This stage use the Kaldi feature extraction to generate Log- Mel-filterbank and pit ch features ( t otally 83 dimensions). It also provides training data preparation (350k utterances, in data/train_wo rn_u200k ), which combines both left and right channels (150k utterances) of the binaural microphone data ( da ta/train_worn ) and a subset (200 k utterances) of all Kinect microphone data ( data/trai n_u200k ). 4.4.3. Data con version for ESPnet (stage 2) This stage con verts all the information included in the Kaldi data directory (transcriptions, speaker IDs, and input and out- put l engths) to one JSON fi le ( data .json ) except for i nput 4 https://github .com/espnet/espn et features. This stage also creates a character t able (45 characters appeared in the transcriptions). 4.4.4. Languag e model training (stag e 3) Character-base d LSTM l anguage model is trained by using ei - ther a Chain er or PyT orch ba ckend , which is integrated with a decoder network in the follo wing recognition stage. 4.4.5. End-to-end model training (stag e 4) A hybrid CTC /attention-based encoder-decod er network [42] is trained by using either the Chainer or PyT orch back end. The total t raining time is 12 hours wit h a single GPU (T itanX) w hen we use t he PyT orch back end, which is less than t he computa- tional resources required for the K al di LF-MMI TDNN t r ai ning (18 hours with 2-4 GPUs). 4.4.6. Recognition (stage 5) Speech recognition is performed by combining t he LS TM lan- guage model and end-to-end ASR model trained by pre vious stages with multiple C PUs. 4.5. Baseline results T ables 2 and 3 provide the word error rates (W ERs) of the bin- aural (oracle) and reference Kinect array (challenge baseline) microphones. The WERs of the challenge baseline are quite T able 2: WERs f or the de velopment set using the binaur al mi- cr ophones (oracle). De velopme nt set con ventional (GMM) 72.8 con ventional (LF-MMI TDNN) 47.9 end-to-end 67.2 T able 3: WER s for the develop ment set using the refer ence Kinect array with beamforming (c hallenge baseline). De velopme nt set con ventional (GMM) 91.7 con ventional (LF-MMI TDNN) 81.3 end-to-end 94.7 high due to very challenging env ironments of CHiME-5 for all of methods 5 . Comparing these tables, there is a significant per- formance dif ference between the array and binaural microphone results (e.g., 33.4% absolutely in LF-MMI TDNN), which in- dicates that the main dif ficulty of this challenge comes from the source and microphone distance in addition to the sponta- neous and overlapp ed nature of t he speech, which exist in both array and binaural microphone conditions. So, a major part of the challenge lies in dev eloping speech enhanceme nt techniques that can improv e the challen ge baseline to the lev el of t he bin- aural microphone performance. 5 Note, the current end-to-end ASR baseline performs poorly due to an insuf ficient amount of training data. Howe ver , the result of end-to- end ASR was better than that of the Kaldi GMM system when we used the binaura l microp hones for testing, whic h shows end-to-end ASR to be a promising direction for this challe nging en vironment. T able 4 sho ws the WER of the LF-MMI TDNN system wi th the deve lopment set for each session and room. The cha llenge participants ha ve to submit this form with the ev aluation set. W e observe that performance i s poorest in the kitchen cond i - tion, probably due to the kitchen background noises and greater degree of speaker mov ement that occurs in this l ocation. T able 4: WERs of the LF-MMI T DNN system f or eac h session and r oom conditions. The challenge participants h ave to submit this form scor ed with the evaluation set. De velopme nt set S02 S09 KITCHEN 87.3 81.6 DINING 79.5 80.6 LIVING 79.0 77.6 5. Conclusion The ‘CHiME’ challenge seri es is aimed at ev aluating ASR in real-world conditions. This paper has presented the 5th edition which targets con versational speech in an informal dinner party scenario recorded with multiple microphone arrays. The full dataset and state-of-the-art software baseline s hav e been made publicly av ailable. A set of challenge instructions has been care- fully designe d to allow meaningful comparison between sys- tems and maximise scientific outcomes. The submitted systems and the results will be anno unced at the 5th ‘CHiME’ ISCA W orkshop. 6. Acknowledgeme nts W e would like to thank Google for funding t he full data collec- tion and annotation, Microsoft Research for providing Kinects, and Microsoft India for sponsoring the 5th ‘CHiME’ W orkshop. E. V incent ackno wledges support from the F rench N ati onal Re- search Agency in the fr amework of the project V OCADOM “Robu st voice command ada pted to the user and to the context for AAL ” (ANR-16-CE33-0006). 7. Referen ces [1] T . V irtanen, R. Singh, and B. Raj, Eds., T echniq ues for Noise Ro- bustne s s in Automatic Speec h Recogni tion . W iley , 2012. [2] J. L i, L. Deng, R. Haeb-Umbach, and Y . Gong, Robust Automati c Speec h R eco gnition — A Bridge to Practical Applica tions . E lse- vier , 2015. [3] S. W atanabe , M. Delcroix, F . Metze, and J. R. Hershey , Eds., New Era for Robust Spe ec h Recognitio n — Exploitin g Deep Learning . Springer , 2017. [4] E. V incent, T . V irtanen, and S . Gannot, Eds., Audio Sour ce Sepa- ratio n and Speech Enhancement . W iley , 2018. [5] S. Makino, Ed., Audio Sour ce Separation . Springe r , 2018. [6] http:// aurora.hsnr .de/aurora- 3 /reports .html . [7] J. H. L. Hansen, P . Angkit itrakul , J. Plucienk owski, S. Gallant, U. Y apanel, B. Pellom, W . W ard, and R. Col e, “”CU-Mo ve”: Analysis & corpus dev elopment for interact i ve in-vehi cle speech systems, ” in Proc . Eur ospeech , 2001, pp. 2023–2026. [8] L. Lamel, F . Schiel, A. Fourcin, J. Mariani, and H . Ti llman, “The translin gual English database (TED), ” in Pr oc. 3rd Int. Conf. on Spok en Language Pro cessing (ICSLP) , 1994. [9] E. Zwyss ig, F . Faub el, S. Renals, an d M. Lincoln, “Re cognit ion of ov erlapping speech using digital MEMS microphone arrays, ” in Pr oc. IEEE Int. Con f. on Acoustic s , Speec h and Si gnal Proce ssing (ICASSP) , 2013, pp. 7068–7072. [10] J. Bark er, R. Marxer , E. V incent, and S. W atanabe, “The third ‘CHIME’ speech separat ion and recognit ion challenge: Analy- sis and outcomes, ” Computer Speech and Language , v ol. 46, pp. 605–626, 2017. [11] E. V incent, S. W atanabe, A . A. Nugraha, J. Barke r , and R. Marxer , “ An analysis of en vironment, microphone and data simula tion mismatches in robust speech recognition , ” Computer Speech and Languag e , vol. 46, pp. 535–557, 2017. [12] G. Grav ier , G. Adda, N. Paulsson, M. Carr ´ e, A. Giraudel, and O. Galibert , “The ET APE corpus for the ev aluation of speech - based TV content processing in the French languag e, ” in Proc . 8th Int. Conf. on Languag e Resourc es and Evaluation (LREC) , 2012, pp. 114–118. [13] P . Bell, M. J. F . Gales, T . Hain, J . Kilgour , P . Lanchanti n, X. Liu, A. McParland , S. Renals, O. Saz, M. W ester , and P . C. W oodland, “The MGB challenge: Eva luatin g multi -genre broadcast m edia recogni tion, ” in Proc . IEEE Automatic Speech Recog nition and Understa nding W orkshop (ASRU) , 2015, pp. 687–693. [14] J. Barker , E . V incent, N. Ma, H. Christensen , and P . Green, “The P ASCAL CHi ME speech separation and recognition chall enge, ” Comp. Speec h and Lang. , vol. 27, no. 3, pp. 621–633, May 2013. [15] E. V incent, J. Barker , S. W atanabe, J. Le Roux, F . Nesta, and M. Matassoni, “The second CHiME speech separation and recog- nition challenge : An overvi e w of challenge s ystems and out- comes, ” in P r oc. IEEE Automati c Spe ech Recog nition and Un- derstan ding W orkshop (A SR U) , 2013, pp. 162–167. [16] A. Brutti, L. Cristoforetti , W . Ke llermann, L . Marquardt , an d M. Omologo, “WOZ acoustic data collec tion for interac tiv e T V , ” in P r oc. 6th Int. Conf. on Languag e Resour ces and Evaluati on (LREC) , 2008, pp. 2330–2334. [17] M. V acher , B. Lecouteux, P . Chahuara , F . Portet, B. Meillon , and N. Bonnefond, “The Sweet-Home speech and multimodal corpus for home automation interact ion, ” in P r oc. 9th Int. Conf . on Lan- guag e Resourc es and Evaluati on (LR EC) , 2014, pp. 4499–4509. [18] M. Rav anelli , L. Cristoforetti , R. Gretter , M. Pellin, A. Sosi, and M. Omologo, “ The DIRHA-English corpus and related tasks for distant -speech recognit ion in domestic en vironments, ” in Proc . IEEE Automat ic Speech Reco gnition and Understanding W ork- shop (ASR U) , 2015, pp. 275–282. [19] N. Bertin, E. Camberlein , E. V incent , R. Lebarbenchon, S. Peil- lon, ´ E. L amand ´ e, S. Si vasanka ran, F . Bimbot, I. Illin a, A. T om, S. Fleury , and E. J amet, “ A French corpus for distant-mi crophone speech processing in real homes, ” in P r oc. Interspeec h , 2016, pp. 2781–2785. [20] W . Xiong, J. Droppo, X. Huang, F . Seide, M. Seltzer , A. Stolck e, D. Y u, and G. Zweig, “ Achie ving human parity in con versationa l speech recognitio n, ” arXi v:1610.05256, 2017. [21] G. Saon, G. Kurata, T . Sercu, K. Audhkhasi, S. Thomas, D. Dim- itriad is, X. Cui, B. Ramabhadran, M. Picheny , L.-L. L im, B. Roomi, and P . Hall, “English con versat ional telephone speech recogni tion by humans and machines, ” arXi v:1703.02136 , 2017. [22] J. J. Godfrey , E. C. Holliman, and J. McDaniel , “SWITCH- BO ARD: T elephone speech corpus for research and de velop- ment, ” in Pr oc. IEEE Int ernationa l Conf. on A coustic s, Speec h, and Signal Pro c. (ICASSP) , vol. 1, 1992, pp. 517–520. [23] M. Harper , “The automatic speech recognit ion i n rev erberan t en vironments (ASpIRE) challenge, ” in Pr oc. IEEE Automatic Speec h Recognit ion and Understandin g W orkshop (ASRU) , 2015, pp. 547–554. [24] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Mor- gan, B. Peskin, T . Pfau , E. Shriber g, A . Stolck e, a nd C. W ooters, “The ICSI meeting corpus, ” in P r oc. IEE E Int. Conf. on Aco ustics, Speec h, and Signal Pro cessing (ICASSP) , 2003, pp. 364–367. [25] D. Mostef a, N. Moreau, K. Choukri, G. Potamian os, S . Chu, A. T yagi, J. Casas, J. Tu rmo, L. Cristoforetti, F . T obia, A. Pnev- matikaki s , V . Mylonakis, F . T alantzis, S. Burger , R. Stiefelhage n, K. Bernardin, and C. Ro chet, “The CHIL audiovisua l corpu s for lectu re and meeting analysis inside smart rooms, ” Langua ge Re- sour ces and Evaluat ion , vol. 41, no. 3–4, pp. 389–407, 2007. [26] S. Renals, T . Hain, and H. Bourlard, “Interpretati on of multiparty meetings: The AMI and AMID A projects, ” in Proc. 2nd Joint W orkshop on Hands-free Speec h Communic ation and Mi cr ophone Arrays (HSCMA) , 2008, pp. 115–118. [27] https:// www .ll.mit.edu/mission/ cybersec /HL T/corpora/SpeechCorpora.html . [28] A. Stupak ov , E. Hanusa, D. V ijaywar gi, D. Fox, and J. Bilme s, “The design and colle ction of COSINE, a multi-mic rophone in situ speech corpus recorded in noisy envi ronments, ” Computer Speec h and Languag e , vol. 26, no. 1, pp. 52–66, 2011. [29] C. Fox, Y . Liu, E. Zwyssig, and T . Hain, “The Shef field warg ames corpus, ” in Proc. I nterspe ech , 2013, pp. 1116–1120. [30] J. W . Du Bois, W . L . Chafe, C. Meyer , S. A. T hompson, R. E ngle- bretson, and N. Marte y , “Santa Barbara corpus of spoken Ameri- can English, parts 1–4, ” Linguisti c Data Consortium. [31] C. Knapp and G. Carter , “The generalized correla tion method for estimati on of time delay , ” IEEE T rans. Acoustics, Speec h, and Signal Proc essing , vol. 24, no. 4, pp. 320–327, Aug. 1976. [32] X. Anguera, C. W ooters, and J . Hernando, “ Acoustic beamform- ing for speaker dia rizat ion of m eetings, ” IEEE T rans. on Audio, Speec h, and Lang. Pr oc. , vol. 15, no. 7, pp. 2011–2023, 2007. [33] D. Pov ey , A. Ghoshal, G. Bouli anne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y . Qian, P . Schwarz, J. Silovsk y , G. Stemmer , and K. V esely , “The Kaldi speech recog- nition toolkit, ” in Proc. IEEE Automatic Speech Reco gnition and Understa nding W orkshop (ASRU) , 2011. [34] J. R. Nova k, N. Minematsu, and K. Hirose, “WFST -based grapheme-t o-phoneme conv ersion: Open source too ls for align- ment, model-buildi ng and decod ing, ” in Pr oceedings of t he 10th Internati onal W orkshop on F inite Stat e Methods and Natura l La n- guag e Pro cessing , 2012, pp. 45–49. [35] J. Wu and S. Khudanpur , “ Buildin g a topic-depende nt maximum entrop y model for very large corpora, ” in IEEE International Co n- fer ence on A coustics, Speec h, and Signal Proce ssing , vol. 1, May 2002, pp. I–777–I–780. [36] T . Alum ¨ ae and M. Kurimo, “Ef ficient estimation of maximum en- tropy language models with N-gram features: an SRILM ext en- sion, ” in Proc . Inter s peec h , Chiba, Japan, September 2010. [37] A. Stol cke et al. , “SRILM-an extensible langua ge modeling toolkit . ” in Inter speech , vol. 2002, 2002, pp. 901–904. [Online]. A vaila ble: http: //www . speech.sri.com/ project s/srilm/ [38] V . Peddinti, V . Manohar , Y . W ang, D. Povey , a nd S. Khudanpur , “Far -field ASR without para llel da ta. ” in INTERSPEECH , 2016, pp. 1996–2000. [39] D. Pov ey , V . Peddinti, D. Galvez , P . Ghahrmani, V . Manohar , X. Na, Y . W ang, and S. Khudanpur , “Purely s equenc e-trai ned neu- ral networks for ASR based on latti ce-free MMI, ” in P r oc. Inter- speec h , 2016, pp. 2751–2755. [40] S. T okui, K. Oono, S. Hido, and J. Clayton, “Cha iner: a nex t- generat ion open s ource framew ork for deep learning, ” in P r oceed- ings of workshop on machine learni ng systems (Lea rningSys) in the twenty-n inth annual confer ence on neural information pro- cessing systems (NIPS) , vol. 5, 2015. [41] A. Paszk e, S. Gross, S. Chintala, G. Chanan, E. Y ang, Z. DeV ito, Z. L in, A. Desmaison, L. Antiga, and A. L erer , “ Automatic diffe r- entia tion in PyT orch, ” in Pr oceedi ngs of The future of gradient - based machin e learning software and tec hniques (Autodif f) in the twenty -ninth annual confe re nce on neur al information pro cessing systems (NIPS) , 2017. [42] S. W atanabe, T . Hori, S. Kim, J. R. Hershey , and T . Hayashi, “Hybrid CTC/attent ion arc hitecture for end-to-end speech recog- nition, ” IE EE Jo urnal of Selected T opics in Signal Pr ocessing , vol. 11, no. 8, pp. 1240–1253, 2017.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment