End-to-End Monaural Multi-speaker ASR System without Pretraining

END-TO-END MONA URAL MUL TI-SPEAKER ASR SYSTEM WITHOUT PRETRAINING Xuankai Chang 1 , 2 , Y anmin Qian 1 , Kai Y u 1 , Shinji W atanabe 2 1 SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao T ong Uni versity , China 2 Center for Language and Speech Processing, Johns Hopkins Uni versity , U.S.A xuank@sjtu.edu.cn, yanminqian@sjtu.edu.cn, kai.yu@sjtu.edu.cn, shinjiw@jhu.edu ABSTRA CT Recently , end-to-end models have become a popular approach as an alternativ e to traditional hybrid models in automatic speech recog- nition (ASR). The multi-speak er speech separation and recognition task is a central task in cocktail party problem. In this paper, we present a state-of-the-art monaural multi-speaker end-to-end auto- matic speech recognition model. In contrast to previous studies on the monaural multi-speaker speech recognition, this end-to-end framew ork is trained to recognize multiple label sequences com- pletely from scratch. The system only requires the speech mixture and corresponding label sequences, without needing any indetermi- nate supervisions obtained from non-mixture speech or correspond- ing labels/alignments. Moreov er, we exploited using the individual attention module for each separated speaker and the scheduled sam- pling to further improv e the performance. Finally , we ev aluate the proposed model on the 2-speaker mixed speech generated from the WSJ corpus and the wsj0-2mix dataset, which is a speech separation and recognition benchmark. The experiments demonstrate that the proposed methods can impro ve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams. From the results, the proposed model leads to ∼ 10 . 0% relativ e performance gains in terms of CER and WER respectiv ely . Index T erms — Cocktail party problem, multi-speaker speech recognition, end-to-end speech recognition, CTC, attention mecha- nism 1. INTR ODUCTION In the deep learning era, single-speaker automatic speech recogni- tion systems have achiev ed a lot of progress. Deep neural networks (DNN) and hidden markov model (HMM) based hybrid systems hav e attained surprisingly good performance [1, 2, 3]. Recently , there has been a growing interest in dev eloping end-to-end mod- els for speech recognition [4, 5, 6], in which the various modules of the hybrid systems, such as the acoustic model (AM) and lan- guage model (LM), are folded into a single neural network model. T wo major approaches of end-to-end speech recognition systems are connectionist temporal classiﬁcation [7, 8, 9] and attention-based encoder-decoder [10, 11]. The performance of deep learning based con ventional speech recognition systems has been reported to be comparable with, or e ven surpassing, human performance [3]. How- ev er , it is still extremely dif ﬁcult to solve the cocktail party problem [12, 13, 14, 15], which refers to the task of separating and recogniz- ing the speech from a speciﬁc speaker when it is interfered by noise This work was down while Xuankai Chang was an intern at the Johns Hopkins Univ ersity . and speech from other speakers. T o address the monaural multi-speak er speech separation and recognition problem, there has been a lot of research in single- channel multi-speaker speech separation and recognition, which aims to separate the overlapping speech and recognize the resulting separated speech indi vidually , gi ven a single-channel multiple- speaker mixtured speech. In [16, 17], a method called deep cluster- ing (DPCL) was proposed for speech separation. DPCL separates the mixed speech by training a neural network to project each time- frequency (T -F) unit into a high-dimensional embedding space, in which pairs of T -F units are close to each other if they have the same dominating speaker and farther a way otherwise. In addition to segmentation using k-means clustering, a permutation-free mask ob- jectiv e was proposed to reﬁne the output [17]. In [18, 19], a speech separation method called permutation in variant training (PIT) was proposed to train a compact deep neural network with the objectiv e that minimizes the average minimum square error of the best output- target assignment at the utterance level. PIT was later extended to train a speech recognition model for multi-speaker speech mixture by directly optimizing with the ASR objectiv e [20, 21, 22, 23, 24]. In [25, 26], a joint CTC/attention-based encoder-decoder network for end-to-end speech recognition [4, 5] was applied to multi- speaker speech recognition. First, an encoder separates the mixed speech into hidden vector sequences for every speaker . Then an attention-based decoder is used to generate the label sequence for each speaker . T o avoid label permutation problem, a CTC objec- tiv e is used in permutation-free manner right after the encoder to determine the order of the label sequences. Ho wev er , the model needs to ﬁrst be pre-trained on single-speaker speech so that decent performance can be achiev ed. In this paper , we explore sev eral new methods to reﬁne the end- to-end speech recognition model for multi-speaker speech. Firstly , we revise the model in [26] so that pretraining on single-speak er speech is not required without loss of performance. Secondly , we propose to use speaker parallel attention modules. In previous work, the separated speech streams were treated equally in the decoder , re- gardless of the energy and speaker characteristics. W e bring in mul- tiple attention modules [27] for each speaker to enhance the speaker tracing ability and to alleviate the burden of the encoder as well as [23]. Another method is to use scheduled sampling [28] to randomly choose the token from either the ground truth or the model prediction as the history information, which reduces the gap between training and inference in the sequence prediction tasks. This would be ex- tremely helpful in our setup, since the separation is not always per- fect and we often observe mixed label results. Schedule sampling can help to recov er such errors during inference. The rest of the paper is organized as follows: In Section 2, the end-to-end monaural multi-speak er ASR model and the proposed new methods are described. In Section 3, we ev aluate the proposed approach on the 2-speak er mixing WSJ data set, and the experiments and analysis are giv en. Finally the paper is concluded in Section 4. 2. END-TO-END MUL TI-SPEAKER JOINT CTC/A TTENTION-B ASED ENCODER-DECODER In this section, we ﬁrst describe the end-to-end ASR system for multi-speaker speech that has been used in [26]. Then we introduce two techniques to improve the training process and performance of the end-to-end ASR multi-speaker system, namely the speaker par- allel attention and scheduled sampling [28]. 2.1. End-to-End Multi-speaker ASR In [4, 5, 29], an end-to-end speech recognition model was proposed to take adv antage of both the Connectionist T emporal Classiﬁca- tion (CTC) and attention-based encoder-decoder , in aim of using the CTC to enhance the alignment ability of the model. An end-to- end model for multi-speaker speech recognition was brought up in [26], extending the joint CTC/attention-based encoder-decoder net- work to be applied on multi-speaker speech mixtures and to allow the permutation-free training in the objectiv e function to address the permutation problem. The model is shown in Fig.1, in which the modules Attention 1 and Attention 2 share parameters. The input speech mixture is ﬁrst explicitly separated into multiple sequences of vectors in the encoder , each representing a speaker source. These sequences are fed into the decoder to compute the conditional prob- abilities. The encoder of the model can be divided into three stages, namely the Enco der Mix , Enco der SD and Enco der Rec . Let O denote an input speech mixture from S speakers. The ﬁrst stage, Enco der Mix , is the mixture encoder, which encodes the input speech mixture O as an intermediate representation H . Then, the representation H is processed by S speaker -different (SD) encoders, Enco der SD , with the outputs being referred to as feature sequences H s , s = 1 , · · · , S . Enco der Rec , the last stage, transforms the features sequences to high-lev el representations G s , s = 1 , · · · , S . The encoder is computed as H = Enco der Mix ( O ) (1) H s = Enco der SD s ( H ) , s = 1 , · · · , S (2) G s = Enco der Rec ( H s ) , s = 1 , · · · , S (3) In the single-speaker joint CTC/attention-based encoder-decoder network, the CTC objective function is used to train the atten- tion model encoder as an auxiliary task right after the encoder [4, 5, 29]. While in the multi-speaker frame work, the CTC objectiv e function is also used to perform the permutition-free training as in Eq.4, which is referred to as permutation in variant training in [15, 18, 20, 21, 22, 23, 24, 30, 31]. ˆ π = arg min π ∈P X s Loss ctc ( Y s , R π ( s ) ) , (4) where Y s is the output sequence variable computed from the en- coder output G s , π ( s ) is the s -th element in a permutation π of { 1 , · · · , S } , and R is the reference labels for S speakers. Later , the permutation ˆ π with minimum CTC loss is used for the reference labels in the attention-based decoder in order to reduce the compu- tational cost. After obtaining the representations G s , s = 1 , · · · , S from the encoder, an attention-based decoder network is used to decode these streams and output label sequence Y s for each representation stream according to the permutation determined by the CTC objec- tiv e function. For each pair of representation and reference label index ( s, ˆ π ( s )) , the decoding process is described as the following equations: p att ( Y s, ˆ π ( s ) | O ) = Y n p att ( y s, ˆ π ( s ) n | O , y s, ˆ π ( s ) 1: n − 1 ) (5) c s, ˆ π ( s ) n = Atten tion( a s, ˆ π ( s ) n − 1 , e s, ˆ π ( s ) n − 1 , G s ) (6) e s, ˆ π ( s ) n = Up date( e s, ˆ π ( s ) n − 1 , c s, ˆ π ( s ) n − 1 , y ˆ π ( s ) n − 1 ) (7) y s, ˆ π ( s ) n ∼ Deco der( c s, ˆ π ( s ) n , y ˆ π ( s ) n − 1 ) (8) where c s, ˆ π ( s ) n denotes the context vector , e s, ˆ π ( s ) n is the hidden state of the decoder , and r ˆ π ( s ) n is the n -th element in the reference label sequence. During training, the reference label r ˆ π ( s ) n − 1 in R is used as a history in the manner of teacher-forcing, instead of y ˆ π ( s ) n − 1 in Eq.7 and Eq.8. And, Eq.5 means the probability of the target label sequence Y = { y 1 , · · · , y N } that the attention-based encoder-decoder pre- dicted, in which the probability of y n at n -th time step is dependent on the previous sequence y 1: n − 1 . The ﬁnal loss function is deﬁned as L mtl = λ L ctc + (1 − λ ) L att , (9) L ctc = X s Loss ctc ( Y s , R ˆ π ( s ) ) , (10) L att = X s Loss att ( Y s, ˆ π ( s ) , R ˆ π ( s ) ) , (11) where λ is the interpolation factor , and 0 ≤ λ ≤ 1 . 2.2. Speaker parallel attention modules Representation 𝐆 " Representation 𝐆 # Reference 𝐑 " Reference 𝐑 # ℒ &'& ( Permu tati on in vari ant t rain in g Attention 1 Attention 2 Decoder Per mut ation 𝐑 ) * Recogni tio n encod er Input mixture 𝐎 Mix tur e e ncode r SD encod er 1 SD encod er 2 Fig. 1 . End-to-End Multi-speaker Speech Recognition Model in the 2-Speaker Case Due to the differences in the characteristics of speakers and en- ergy , the encoder usually has to compensate for those differences while separating the speech. The motiv ation of speaker parallel at- tention module that we proposed is to alleviate the b urden for the en- coder and to make the attention-decoder learn to ﬁlter the separated speech as well while keeping the model compact. In light of [23], we proposed to use independent attention modules called speak er paral- lel attention. Fig.1 illustrates the architecture of the model, in which Attention 1 and Attention 2 are not sharing. The computation process in Eq.6 should be rewritten in a stream-speciﬁc w ay , in particular for the s -th stream, as: c s, ˆ π ( s ) n , a s, ˆ π ( s ) n = Atten tion s ( a s, ˆ π ( s ) n − 1 , c s, ˆ π ( s ) n − 1 , G s ) (12) 2.3. Scheduled sampling W e generally trained the decoder network in a teacher-forcing fash- ion, which means the reference label tok en r n , not the predicted token y n , is used to predict the next token in the sequence during training. Howe ver , during inference, we are only accessible to the predicted token y n from the model itself. This difference may lead to performance degradation, especially in the multi-speaker speech recognition task susceptible to the label permutation problem. W e al- leviate this problem by using the scheduled sampling technique [28]. During training, whether the history information is chosen from the ground truth label or the prediction is done randomly with a prob- ability of p from the the prediction and (1 − p ) from ground truth. Thus Eq.7 and Eq.8 should be changed as: e s, ˆ π ( s ) n = Up date( e s, ˆ π ( s ) n − 1 , c s, ˆ π ( s ) n − 1 , h ) , (13) y s, ˆ π ( s ) n ∼ Deco der( c s, ˆ π ( s ) n , h ) , (14) where b ∼ B er noulli ( p ) , (15) h = ( r ˆ π ( s ) n − 1 , if b = 0 y ˆ π ( s ) n − 1 , if b = 1 (16) 3. EXPERIMENT 3.1. Experimental setup T o e valuate our method, we used the artiﬁcially generated single- channel two-speaker mixed signals from the W all Street Journal (WSJ) speech corpus according to [26], using the tool released by MERL 1 . W e used the WSJ SI284 to generate the training data, Dev93 for dev elopment and Eval92 for evaluation. The durations for the training, development and ev aluation sets of the mixed data are 98.5 hr, 1.3 hr, and 0.8 hr respectiv ely . In section 3.4, we also compared our model with previous w orks on the wsj0-2mix dataset, which is a standard speech separation and recognition benchmark [16, 17, 25]. The input feature is 80-dimensional log Mel ﬁlterbank coefﬁ- cients with pitch features and their delta and delta delta features ex- tracted using the Kaldi [32]. Zero mean and unit variance are used to normalize the input features. All the joint CTC/attention-based encoder-decoder networks for end-to-end speech recognition were built based on the ESPnet [6] frame work. The networks were initial- ized randomly from uniform distribution in the range − 0 . 1 to 0 . 1 . W e used the AdaDelta algorithm with ρ = 0 . 95 and  = 1 e − 8 . During training, we set the interpolation factor λ in Eq.9 to be 0 . 2 . W e revise the deep neural network, replacing the original encoder layers with shallower b ut wider layers [33], so that the performance can be good enough without pre-training on single-speaker speech. T o make the model comparable, we set all the neural network models to have the same depth and similar size. W e use the VGG- motiv ated CNN layers and bidirectional long-short term memory re- current neural networks with projection (BLSTMP) as the encoder . 1 http://www .merl.com/demos/deep-clustering/create-speaker - mixtures.zip The total depth of the encoder is 5, namely two CNN blocks and three layer BLSTMP layers. For all models, the decoder network has 1 layer of unidirectional long-short term memory network (LSTM) with 300 cells. During decoding, we combined both the joint CTC/attention score and the pretrained w ord-lev el recurrent neural network lan- guage model (RNNLM) score, which had 1-layer LSTM with 1000 cells and was trained on the transcriptions from WSJ SI284, in a shallow fusion manner . W e set the beam width to be 30. The inter- polation factor λ we used during decoding was 0 . 3 , and the weight for RNNLM was 1 . 0 . 3.2. Perf ormance of baseline systems In this section, we describe the performance of the baseline E2E ASR systems on multi-speak er mixed speech. The ﬁrst baseline sys- tem is the joint CTC/attention-based encoder-decoder network for single-speaker speech trained on WSJ corpus, whose performance is 0 . 9 % in terms of CER and 1 . 9 % in terms of WER on the eval92 5k test set with the closed vocab ulary . In the encoder, there are 3 lay- ers of BLSTMP following the CNN and each BLSTMP layer has 1024 memory cells in each direction. The second baseline system is the joint CTC/attention-based encoder-decoder netw ork for multi- speaker speech. The 2-layer CNN is used as the Enco der Mix . The depth of the following BLSTMP layers is also 3 including 1 layer of BLSTMP as the Encoder SD and 2 layers of BLSTMP as the Enco der Rec . The attention-decoder in the multi-speaker system is shared among representations G s , which is of the same architecture with single-speaker system. The results are shown in T able 1. Model dev CER ev al CER single-speaker 79.13 76.52 multi-speaker [26] n/a 13.7 multi-speaker 15.14 12.20 Model dev WER ev al WER single-speaker 113.47 112.21 multi-speaker 24.90 20.43 T able 1 . Performance (A vg. CER & WER) (%) on 2-speaker mixed WSJ corpus. Comparison between End-to-End single-speaker and multi-speaker joint CTC/attention-based encoder -decoder systems In the case of single-speaker, the CER and WER is measured by comparing the output against the reference labels of both speak- ers. From the table, we can see that the speech recognition sys- tem designed for multi-speaker can improve the performance for the ov erlapped speech signiﬁcantly , leading to more than 80 . 0% rela- tiv e reduction on both av erage CER and WER. As a comparison, we also include the CER result from [26] in the table, and it shows that the newly constructed end-to-end multi-speaker system without pretraining in this work can achie ve better performance. 3.3. Perf ormance of speaker parallel attention with scheduled sampling In this section we report the results of the ev aluation of our pro- posed methods. The ﬁrst method is the speaker parallel attention, introducing independent attention modules for each speaker source instead of using a shared attention module. The rest of the network is kept the same as the baseline multi-speaker model, containing a 2-layer CNN Enco der Mix , 1-layer BLSTMP Enco der SD , a 2-layer BLSTMP Enco der Rec , and a shared 1-layer LSTM as the decoder network. The performance is illustrated in the T able 2. The speaker parallel attention module reduces the average CER by 9% and av er- age WER by 8% relatively . From the results we can tell that the CER is high, so the gap is large between the training and inference using the teacher-forcing fashion. Thus we adopted the scheduled sam- pling method with probability p = 0 . 2 in Eq. 15, which lead to a further improvement in performance. Finally , the system using both speaker parallel attention and scheduled sampling can obtain relati ve ∼ 10 . 0% reduction on both CER and WER on the evaluation set. Model dev CER ev al CER multi-speaker (baseline) 15.14 12.20 + speaker parallel attention 14.80 11.11 ++ scheduled sampling 14.78 10.93 Model dev WER ev al WER multi-speaker (baseline) 24.90 20.43 + speaker parallel attention 24.88 18.76 ++ scheduled sampling 24.52 18.44 T able 2 . Performance (A vg. CER & WER) (%) on 2-speaker mixed WSJ corpus. Comparison between End-to-End multi-speaker joint CTC/attention-based encoder-decoder systems Outp ut token inde x Inp ut seq uence ind ex Single attentio n Speaker parallel att entio n Outp ut token inde x Inp ut seq uence ind ex (a) Attention weights for speaker 1 Outp ut token inde x Inp ut seq uence ind ex Single attentio n Outp ut token inde x Inp ut seq uence ind ex Speaker parallel att entio n (b) Attention weights for speaker 2 Fig. 2 . V isualization of the attention weights sequences for tw o over - lapped speakers. The left part is from the previous single-attention multi-speaker end-to-end model and the right part is from the pro- posed speaker -parallel-attention multi-speaker end-to-end model. W e sho w the visualization of the attention weights sequences for two overlapped speakers, generated by the baseline single-attention multi-speaker end-to-end model and the proposed speaker-parallel- attention multi-speak er end-to-end model indi vidually . The hori- zontal axis represents the output token sequence and the vertical axis represents the input sequence to the attention module. The left parts of Figures.2 (a) and (b) show the attention weights for speaker 1 and speaker 2 generated by the pre vious single-attention model. The right parts show the attention weights generated by the proposed speaker -parallel-attention model. W e can observe that the right parts are more smooth and clear , and the attention weights are more concentrated. This observation conforms with the characteris- tics of alignments between output sequence and input sequence for speech recognition, and further shows the superiority of the proposed speaker parallel attentions. 3.4. Comparison with pr evious work W e then compared our work with other related works. W e trained and tested our model on wsj0-2mix dataset that was ﬁrst used in [16]. T able 3 shows the WER results of hybrid systems including PIT - ASR [24], DPCL-based speech separation with Kaldi-based ASR [17], and the end-to-end systems constructed in [26] and ours in this paper . These were e v aluated under the same evaluation data and metric as in [17] based on the wsj0-2mix. Noted that the model in [26] was trained on a different, larger training dataset than that used in other experiments. From T able. 3, we can observe that our new system constructed by the proposed methods in this paper is signiﬁcantly better than the others. Model A vg. WER DPCL+ASR [17] 30.8 PIT -ASR [24] 28.2 End-to-end ASR (Char/W ord-LM) [26] 28.2 Proposed End-to-end ASR with SP A (W ord LM) 25.4 T able 3 . WER (%) on 2-speaker mixed WSJ0 corpus. The com- parison is done between our proposed end-to-end ASR with speaker parallel attention (SP A) and previous works including DPCL+ASR, PIT -ASR and end-to-end ASR systems. 4. CONCLUSION In this paper, we ha ve introduced a state-of-the-art end-to-end multi- speaker speech recognition system under the joint CTC/attentin- based encoder-decoder framew ork. More speciﬁcally , a new neural network architecture enabled us to train the model from random initialization. And we adopted the speaker parallel attention module and scheduled sampling to improv e performance ov er the pre vious end-to-end multi-speaker speech recognition system. The exper - iments on the 2-speaker mixed speech recognition sho w that the proposed new strategy can obtain a relativ e ∼ 10 . 0% improvement on CER and WER reduction. 5. A CKNO WLEDGEMENT W e are also grateful to Matthew Maciejewski and T ian T an for their comments on an earlier version of the manuscript. 6. REFERENCES [1] Geoffrey Hinton, Li Deng, Dong Y u, George E Dahl, Abdel- rahman Mohamed, Navdeep Jaitly , Andrew Senior , V incent V anhoucke, Patrick Nguyen, T ara N Sainath, et al., “Deep neu- ral networks for acoustic modeling in speech recognition: The shared views of four research groups, ” IEEE Signal Pr ocessing Magazine , v ol. 29, pp. 82–97, 2012. [2] T ara N Sainath, Abdel-rahman Mohamed, Brian Kingsb ury , and Bhuvana Ramabhadran, “Deep conv olutional neural net- works for L VCSR, ” in IEEE (ICASSP) , 2013, pp. 8614–8618. [3] W ayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer , Andreas Stolcke, Dong Y u, and Geof frey Zweig, “The Microsoft 2016 conv ersational speech recognition sys- tem, ” in IEEE (ICASSP) , 2017, pp. 5255–5259. [4] Suyoun Kim, T akaaki Hori, and Shinji W atanabe, “Joint ctc- attention based end-to-end speech recognition using multi-task learning, ” in IEEE (ICASSP) , 2017, pp. 4835–4839. [5] Shinji W atanabe, T akaaki Hori, Suyoun Kim, John R. Hershey , and T omoki Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition, ” J. Sel. T opics Signal Pr ocess- ing , vol. 11, no. 8, pp. 1240–1253, 2017. [6] Shinji W atanabe, T akaaki Hori, Shigeki Karita, T omoki Hayashi, Jiro Nishitoba, Y uya Unno, Nelson Enrique Y alta So- plin, Jahn Heymann, Matthew Wiesner , Nanxin Chen, et al., “Espnet: End-to-end speech processing toolkit, ” arXiv preprint arXiv:1804.00015 , 2018. [7] Alex Grav es and Navdeep Jaitly , “T o wards end-to-end speech recognition with recurrent neural networks, ” in ICML , 2014, pp. 1764–1772. [8] Y ajie Miao, Mohammad Gowayyed, and Florian Metze, “Eesen: End-to-end speech recognition using deep rnn mod- els and wfst-based decoding, ” in IEEE W orkshop on (ASR U) , 2015, pp. 167–174. [9] Zhehuai Chen, Y imeng Zhuang, Y anmin Qian, and Kai Y u, “Phone synchronous speech recognition with ctc lattices, ” IEEE/A CM T ransactions on Audio, Speech and Language Pr o- cessing , vol. 25, no. 1, pp. 86–97, 2017. [10] Jan Chorowski, Dzmitry Bahdanau, Kyungh yun Cho, and Y oshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: ﬁrst results, ” arXiv pr eprint arXiv:1412.1602 , 2014. [11] William Chan, Navdeep Jaitly , Quoc Le, and Oriol V inyals, “Listen, attend and spell: A neural network for large vocab- ulary con versational speech recognition, ” in IEEE (ICASSP) , 2016, pp. 4960–4964. [12] Jean Carletta, Simone Ashby , Sebastien Bourban, Mik e Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, V asilis Karaiskos, W essel Kraaij, Melissa Kronenthal, et al., “The ami meeting corpus: A pre-announcement, ” in International workshop on machine learning for multimodal interaction . Springer , 2005, pp. 28–39. [13] Martin Cooke, John R Hershey , and Steven J Rennie, “Monau- ral speech separation and recognition challenge, ” Computer Speech & Language , vol. 24, no. 1, pp. 1–15, 2010. [14] Jon Barker , Shinji W atanabe, Emmanuel V incent, and Jan T rmal, “The ﬁfth’chime’ speech separation and recognition challenge: Dataset, task and baselines, ” arXiv pr eprint arXiv:1803.10609 , 2018. [15] Y anmin Qian, Chao W eng, Xuankai Chang, Shuai W ang, and Dong Y u, “Past re view , current progress, and challenges ahead on the cocktail party problem, ” F r ontiers of Information T ech- nology & Electronic Engineering , vol. 19, no. 1, pp. 40–63, Jan 2018. [16] J. R. Hershey , Z. Chen, J. Le Roux, and S. W atanabe, “Deep clustering: Discriminativ e embeddings for segmentation and separation, ” in IEEE (ICASSP) , 2016, pp. 31–35. [17] Y usuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji W atanabe, and John R. Hershey , “Single-channel multi-speaker separa- tion using deep clustering, ” in (INTERSPEECH) , 2016, pp. 545–549. [18] Dong Y u, Morten Kolbk, Zheng-Hua T an, and Jesper Jensen, “Permutation in v ariant training of deep models for speaker -independent multi-talker speech separation, ” in IEEE (ICASSP) , 2017, pp. 241–245. [19] Morten Kolbæk, Dong Y u, Zheng-Hua T an, and Jesper Jensen, “Multitalker speech separation with utterance-lev el permuta- tion in variant training of deep recurrent neural networks, ” IEEE/A CM (T ASLP) , v ol. 25, no. 10, pp. 1901–1913, 2017. [20] Dong Y u, Xuankai Chang, and Y anmin Qian, “Recognizing multi-talker speech with permutation in v ariant training, ” in (INTERSPEECH) , 2017, pp. 2456–2460. [21] Z. Chen, J. Droppo, J. Li, and W . Xiong, “Progressive joint modeling in unsupervised single-channel overlapped speech recognition, ” IEEE/ACM (T ASLP) , v ol. 26, no. 1, pp. 184–196, Jan 2018. [22] Zhehuai Chen and Jasha Droppo, “Sequence modeling in unsupervised single-channel ov erlapped speech recognition, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing(ICASSP) , Calgary , Canada, April 2018, pp. 4809–4813. [23] Xuankai Chang, Y anmin Qian, and Dong Y u, “Monaural multi-talker speech recognition with attention mechanism and gated con volutional networks, ” in (INTERSPEECH) , 2018, pp. 1586–1590. [24] Y anmin Qian, Xuankai Chang, and Dong Y u, “Single- channel multi-talker speech recognition with permutation in- variant training, ” Speech Communication , vol. 104, pp. 1 – 11, 2018. [25] Shane Settle, Jonathan Le Roux, T akaaki Hori, Shinji W atan- abe, and John R Hershey , “End-to-end multi-speaker speech recognition, ” in IEEE International Conference on Acous- tics, Speech and Signal Pr ocessing (ICASSP) , 2018, pp. 4819– 4823. [26] Hiroshi Seki, T akaaki Hori, Shinji W atanabe, Jonathan Le Roux, and John R Hershey , “ A purely end-to-end system for multi-speaker speech recognition, ” in (A CL) (V olume 1: Long P apers) , 2018, pp. 2620–2630. [27] Ashish V aswani, Noam Shazeer, Niki Parmar , Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “ Attention is all you need, ” in Advances in Neural Information Processing Systems , 2017, pp. 5998–6008. [28] Samy Bengio, Oriol V inyals, Navdeep Jaitly , and Noam Shazeer , “Scheduled sampling for sequence prediction with recurrent neural networks, ” in (NIPS) - V olume 1 , 2015, pp. 1171–1179. [29] T akaaki Hori, Shinji W atanabe, and John Hershey , “Joint ctc/attention decoding for end-to-end speech recognition, ” in (A CL) (V olume 1: Long P apers) , 2017, v ol. 1, pp. 518–529. [30] Xuankai Chang, Y anmin Qian, and Dong Y u, “ Adaptive permutation in variant training with auxiliary information for monaural multi-talker speech recognition, ” in IEEE (ICASSP) , 2018. [31] Tian T an, Y anmin Qian, and Dong Y u, “Knowledge transfer in permutation in variant training for single-channel multi-talker speech recognition, ” in IEEE (ICASSP) , 2018. [32] Daniel Pov ey , Arnab Ghoshal, Gilles Boulianne, Lukas Bur- get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit, ” . [33] Albert Zeyer , Kazuki Irie, Ralf Schl ¨ uter , and Hermann Ney , “Improv ed training of end-to-end attention models for speech recognition, ” arXiv preprint , 2018.

End-to-End Monaural Multi-speaker ASR System without Pretraining

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment