Multi-Stream End-to-End Speech Recognition

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 1 Multi-Stream End-to-End Speech Recognition Ruizhi Li, Student Member , IEEE, Xiaofei W ang, Member , IEEE, Sri Harish Mallidi, Member , IEEE, Shinji W atanabe, Senior Member , IEEE, T akaaki Hori, Senior Member , IEEE, and Hynek Hermansky , Life F ellow , IEEE Abstract —Attention-based methods and Connectionist T empo- ral Classiﬁcation (CTC) network ha ve been promising resear ch directions for end-to-end (E2E) A utomatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great suc- cess by utilizing both architectur es during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, the Hierarchical Attention Network (HAN) is introduced to steer the decoder toward the most informative encoders. A separate CTC network is assigned to each str eam to f orce monotonic align- ments. T wo r epresentative framework ha ve been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM- Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respecti vely . In MEM-Res framework, two hetero- geneous encoders with different architectur es, temporal resolu- tions and separate CTC networks work in parallel to extract complementary inf ormation from same acoustics. Experiments are conducted on W all Str eet J ournal (WSJ) and CHiME- 4, resulting in relati ve W ord Err or Rate (WER) reduction of 18 . 0 − 32 . 1% and the best WER of 3 . 6% in the WSJ e val92 test set. The MEM-Array framework aims at improving the far- ﬁeld ASR rob ustness using multiple microphone arrays which are activated by separate encoders. Compared with the best single- array results, the proposed framework has achiev ed relati ve WER reduction of 3 . 7% and 9 . 7% in AMI and DIRHA multi- array corpora, respectiv ely , which also outperforms con ventional fusion strategies. Index T erms —End-to-End Speech Recognition, Joint CTC/Attention, Encoder -Decoder , Connectionist T emporal Classiﬁcation, Hierarchical Attention Network, Multi-Encoder Multi-Resolution, Multi-Encoder Multi-Array I . I N T R O D U C T I O N R ECENT advancements in deep neural networks enabled sev eral practical applications of automatic speech recog- nition (ASR) technology . The main paradigm for an ASR system is the so-called hybrid approach [1], which in volv es training a Deep Neural Network (DNN) to predict context dependent phoneme states (or senones) from the acoustic fea- tures. During inference the predicted senone distrib utions are provided as inputs to decoder , which combines with lexicon and language model to estimate the word sequence. Despite the impressiv e accuracy of the hybrid system, it requires hand-crafted pronunciation dictionary based on linguistic as- sumptions, extra training steps to deriv e context-dependent Ruizhi Li, Xiaofei W ang, Shinji W atanabe, and Hynek Hermansky are with Johns Hopkins University (JHU), USA, e-mail: { ruizhili, xiaofeiwang, shinjiw , hynek } @jhu.edu Sri Harish Mallidi is with Amazon, USA, e-mail: mallidih@amazon.com. T akaaki Hori is with Mitsubishi Electric Research Laboratories (MERL), USA, e-mail: thori@merl.com. Manuscript received April 19, 2005; revised August 26, 2015. phonetic models, and text preprocessing such as tokenization for languages without explicit word boundaries. Consequently , it is quite dif ﬁcult for non-experts to de velop ASR systems for new applications, especially for new languages. End-to-End (E2E) speech recognition approaches are de- signed to directly output word or character sequences from the input audio signal. This model subsumes several disjoint components in the hybrid ASR model (acoustic model, pro- nunciation model, language model) into a single neural net- work. As a result, all the components of an E2E model can be trained jointly to optimize a single objecti ve. Three dominant end-to-end architectures for ASR are Connectionist T emporal Classiﬁcation (CTC) [2]–[4], attention-based encoder decoder [5], [6] models and recurrent neural network transducers [7], [8]. While CTC efﬁciently addresses a sequence-to- sequence problem (speech vectors to word sequence mapping) by av oiding the alignment pre-construction step using dynamic programming, it assumes the conditional independence of label sequence gi ven the input. The attention model does not assume conditional independence of a label sequence resulting in a more ﬂexible model. Ho wev er, attention-based methods encounter difﬁculty in satisfying the speech-label monotonic property . There are previous publications to enhance the monotonic behavior in v arious ways [9]–[13]. These studies are similar in a way that they operate local attention on the windowed encoder outputs to enforce monotonicity . A joint CTC/Attention framework was proposed in [14]–[16] with the help of monotonic model, CTC, to alleviate this issue. The joint model was shown to provide the state-of-the-art E2E results in several benchmark datasets [16]. In this work, we propose a multi-stream architecture within the joint CTC/Attention framework. Multi-stream paradigm was successfully used in hybrid ASR [17]–[20] motiv ated by observations of multiple parallel processing streams in human speech processing cognitiv e system. For instance, forming streams by band-pass ﬁltering the signal with stream dropout was proposed to deal with noise rob ustness scenario mimick- ing human auditory process [17], [19]. Ho wev er , multi-stream approaches ha ve not been inv estigated for E2E ASR models. This paper is an e xtension of our prior study [21], which suc- cessfully applied the proposed multi-stream concept to multi- array ASR. In this work, we present a general formulation to multi-stream framework and two practical E2E applications (MEM-Res and MEM-Array) with additional experiments and discussions. The framework has the following highlights: 1) Multiple Encoders in parallel acting as information streams. T wo ways of forming the streams hav e been proposed in this work according to different applica- tions: P arallel encoders with dif ferent architectures and JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 2 temporal resolutions operate on the same acoustics, which we refer to as Multi-Encoder Multi-Resolution (MEM-Res) model; P arallel input speech from multiple microphone arrays are fed into separate but identical encoders, which we refer to as Multi-Encoder Multi- Array (MEM-Array) model. 2) The Hierarchical Attention Network (HAN) [22]–[24] is introduced to dynamically combine knowledge from parallel streams. While one way of information fusion is to apply one attention mechanism across the outputs of multiple encoder [24], sev eral studies demonstrated beneﬁts of multiple attention mechanisms [22]–[27]. In [28], [29], secondary attention modules provide a way to incorporate additional contextual information beneﬁcial to the tasks. Inspired by the advances in hierarchical attention mechanism in document classi- ﬁcation task [22], multi-modal video description [23] and machine translation [24], we adopt HAN into our multi-stream model. The encoder that carries the most discriminativ e information for the prediction can dynam- ically receive a higher weight. On top of the per-encoder attention mechanism, stream attention is employed to steer toward the stream, which carries more task-related information. 3) Each encoder is associated with a separate CTC network to guide the frame-wise alignment process for each stream to potentially achie ve better performance. In MEM-Res model, two parallel encoders with heteroge- neous structures are mutually complementary in characterizing the speech signal. In E2E ASR, the encoder acts as an acoustic model providing higher-le vel features for decoding. Bi-directional Long Short-T erm Memory (BLSTM) has been widely used due to its ability to model temporal sequences and their long-term dependencies as the encoder architecture; Deep Con volutional Neural Network (CNN) was introduced to model spectral local correlations and reduce spectral variations in E2E framework [15], [30]. The encoder combining CNN with recurrent layers, was suggested to address the limitation of LSTM. While temporal subsampling in RNN and max- pooling in CNN aim to reduce the computational complexity and enhance the robustness, it is likely that subsampling technique results in loss of temporal resolution. The MEM-Res model proposes to combine RNN-based and CNN-RNN-based networks to form a complementary multi-stream encoder . In addition to MEM-Res, MEM-Array model is one of the other applications of our multi-stream E2E frame work. Far -ﬁeld ASR using multiple microphone arrays has become important strategies in the speech community to ward a smart speaker scena rio in a meeting room or house en vironment [31]–[33]. Individually , the microphone array is able to bring a substantial performance improvement with algorithms such as beamforming [34] and masking [35]. Howe ver , what kind of information can be extracted from each array and how to make multiple arrays work in cooperation are still challeng- ing. Time synchronization among arrays is one of the main challenges that most distrib uted setup face [36]. W ithout any prior kno wledge of speaker -array distance or video monitoring, it is difﬁcult to estimate which array carries more reliable information or is less corrupted. According to the reports from the CHiME-5 challenge [33], which targets the problem of multi-array con versational speech recognition in home en vironments, the common ways of uti- lizing multiple arrays in the hybrid ASR system are ﬁnding the one with highest Signal-to-Noise/Signal-to-Interference Ratio (SNR/SIR) for decoding [37] or fusing the decoding results by voting for the most conﬁdent words [38], e.g. R O VER [39]. Similar to our previous work [40] [41], combination using the classiﬁer’ s posterior probabilities follo wed by lattice generation has been an alternativ e approach [20], [42], [43]. Compared to using the fully decoding results with paths prun- ing, the combination using the posteriors preserves all the in- formation from the test speech as well as the classiﬁer . In terms of the combination strategy , ASR performance monitors have been designed [44], resulting in a process of stream conﬁdence generation, guiding the linear fusion of array streams. While most of the E2E ASR studies engage in single-channel task or multi-channel task from one microphone array [45]–[48], research on multi-array scenario is still unexplored within the E2E framework. The MEM-Array model is proposed to solve the aforementioned problem. The output of each microphone array is modeled by a separate encoder . Multiple encoders with the same conﬁguration act as the acoustic models for individual arrays. Note that we integrate beamformed signals instead of using all multi-channel signals for the multi-stream framew ork, which is computationally ef ﬁcient. This design can make use of the powerful beamforming algorithm as well. This paper is organized as follo ws: section II explains the joint CTC/Attention model. The description of the proposed multi-stream frame work including MEM-Res and MEM-Array is in section III. Experiments with results and se veral analy- ses for MEM-Res and MEM-Array models are presented in section IV and Section V, respectiv ely . Finally , in section VI the conclusion is deri ved. I I . J O I N T C T C / A T T E N T I O N M E C H A N I S M In this section, we re view the joint CTC/attention architec- ture, which takes advantage of both CTC and attention-based end-to-end ASR approaches during training and decoding. A. Connectionist T emporal Classiﬁcation (CTC) CTC enforces a monotonic mapping from a T -length speech feature sequence, X = { x t ∈ R D | t = 1 , 2 , ..., T } , to an L - length letter sequence, C = { c l ∈ U | l = 1 , 2 , ..., L } . Here x t is a D -dimensional acoustic vector at frame t , and c l is at position l a letter from U , a set of distinct letters. The CTC network introduces a man y-to-one function from frame-wise latent v ariable sequences, Z = { z t ∈ U S blank | t = 1 , 2 , ..., T } , to letter predictions with shorter lengths. This is a many-to-one function because many CTC paths can respond to the same label sequence by merg- ing repeating characters and removing blank symbols. W ith sev eral conditional independence assumptions, the posterior distribution, p ( C | X ) , is represented as follo ws: p ( C | X ) ≈ X Z Y t p ( z t | X ) , p ctc ( C | X ) , (1) JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 3 where p ( z t | X ) is a frame-wise posterior distribution, which is often modeled using BLSTM. W e also deﬁne the CTC objectiv e function p ctc ( C | X ) . CTC preserves the beneﬁts that it enforces the monotonic behavior of speech-label alignments, av oids the HMM/GMM construction step and preparation of pronunciation dictionary . B. Attention-based Encoder-Decoder As one of the most commonly used sequence modeling techniques, the attention-based frame work selecti vely encodes an audio sequence of variable length into a ﬁxed dimension vector representation, which is then consumed by the decoder to produce a distribution ov er the outputs. W e can directly estimate the posterior distrib ution p ( C | X ) using the chain rule: p ( C | X ) = L Y l =1 p ( c l | c 1 , ..., c l − 1 , X ) , p att ( C | X ) , (2) where p att ( C | X ) is deﬁned as the attention-based objective function. T ypically , a BLSTM-based encoder transforms the speech vectors X into frame-wise hidden vector h t If the encoder subsamples the input by a f actor s , there will be b T /s c time steps in H = { h 1 , ..., h b T /s c } . The letter -wise context vector r l is formed as a weighted summation of frame-wise hidden vectors H using content-based attention network [6]: r l = X b T /s c t =1 a lt h t , (3) a lt = ContentAttention ( q l − 1 , h t ) , (4) where a lt is the attention weight, a soft-alignment of h t for output c l , and q l − 1 is the previous decoder state. ContentAttention( · ) is described as follo ws: e lt = g > tanh ( Lin ( q l − 1 ) + LinB ( h t )) , (5) a lt = Softmax ( { e lt } b T /s c t =1 ) . (6) g is a learnable vector parameter . { e lt } b T /s c t =1 is a b T /s c - dimensional vector . LinB( · ) and Lin( · ) represent the linear transformation with or without bias term, respectiv ely . In comparison to CTC, not requiring conditional indepen- dence assumptions is one of the advantages of using the attention-based model. Howev er , the attention is too ﬂexible to satisfy monotonic alignment constraint in speech recognition tasks. C. J oint CTC/Attention The joint CTC/Attention architecture beneﬁts from both CTC and attention-based models since the attention-based encoder-decoder is trained together with CTC within the Multi-T ask Learning (MTL) framew ork. The encoder is shared across CTC and attention-based encoders. And the objectiv e function to be maximized is a logarithmic linear combination of the CTC and attention objectiv es, i.e., p ctc ( C | X ) and p † att ( C | X ) : L MTL = λ log p ctc ( C | X ) + (1 − λ ) log p † att ( C | X ) , (7) where λ is a tunable scalar satisfying 0 ≤ λ ≤ 1 . p † att ( C | X ) is an approximated letter-wise objectiv e where the probability of a prediction is conditioned on previous true labels. During inference, the joint CTC/Attention model performs a label-synchronous beam search. The most probable letter sequence ˆ C giv en the speech input X is computed according to ˆ C = arg max C ∈U ∗ { λ log p ctc ( C | X ) + (1 − λ ) log p att ( C | X ) + γ log p lm ( C ) } (8) where external RNN-LM probability log p lm ( C ) is added with a scaling factor γ . For each partial hypothesis h in the beam search, the joint score, the log probability of hypothesized label sequence, can be computed as α ( h ) = λα ctc ( h ) + (1 − λ ) α att ( h ) + γ α lm ( h ) , (9) where the attention decoder score, α att ( h ) , can be accumulated recursiv ely from hypothesis scores from one step before. In terms of CTC score, α ctc ( h ) , we utilize the CTC preﬁx probability deﬁned as the cumulative probability of all label sequences that have h as their preﬁx [49], [50]. In this work, we use the look-ahead word-based language model to give the RNN-LM score [51], α lm ( h ) . This language model enables us to decode with only a word-based model, rather than using a multi-lev el LM which uses a character-le vel LM until the identity of the w ord is determined. I I I . P R O P O S E D M U LT I - S T R E A M F R A M E W O R K The proposed multi-stream architecture is shown in Fig. 1. For simplicity to understand the framework, we focus on the tw o-stream architecture. T wo encoders are presented in parallel to capture information in various ways, followed by an attention fusion mechanism together with per -encoder CTC. An external RNN-LM is also inv olved during the inference step. W e will describe the details of each component in the following sections. A. P arallel Encoders as Multi-Str eam Similar to acoustic modeling in conv entional ASR, the encoder maps the audio features into higher-le vel feature representations for the use of CTC and attention model: h ( i ) t = Encoder ( i ) ( X ( i ) ) , i ∈ { 1 , ..., N } (10) where we denote superscript i ∈ { 1 , ..., N } as the index for Encoder ( i ) corresponding to stream i , h ( i ) t is the frame- wise hidden vector of stream i introduced in Sec. II-B. , and N denotes the number of streams. X ( i ) in Eq. 10 represents a T ( i ) -length speech feature sequence, i.e., X ( i ) = { x ( i ) t ∈ R D | t = 1 , 2 , ..., T ( i ) } . Note that it is not mandatory to have frame-level synchronization across all streams since T ( i ) , i ∈ { 1 , ..., N } , could be different in the proposed model. T ogether with stream-speciﬁc subsampling factor s ( i ) , stream i will have b T ( i ) /s ( i ) c time instances at the encoder-output lev el. Rounding process of b T ( i ) /s ( i ) c is performed in the encoder based on dif ferent architecture. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 4 Attention (1) Attention (2) Str eam Attention Decoder CTC (1) CTC (2) RNNLM … … c 1 c 2 c l c L − 1 c L Stream1 Stream2 x ( 1 ) 1 h ( 1 ) 1 , h ( 1 ) 2 , . . . , h ( 1 ) ⌊ T ( 1 ) / s ( 1 ) ⌋ h ( 2 ) 1 , h ( 2 ) 2 , . . . , h ( 2 ) ⌊ T ( 2 ) / s ( 2 ) ⌋ r (2) l r l r (1) l … x ( 1 ) 2 x ( 2 ) T ( 2 ) … x ( 2 ) 1 x ( 2 ) 2 x ( 1 ) T ( 1 ) Encoder (1) Encoder (2) Fig. 1: The Multi-Stream End-to-End Frame work. For simplicity , multi-stream model with N = 2 is depicted in Fig. 1, where tw o encoders in parallel take different input features, X (1) with T (1) frames and X (2) with T (2) frames, respectiv ely . Each encoder operates on different temporal resolution with subsampling factor s (1) and s (2) , where sub- sampling could be performed in RNN or maxpooling layer in CNN. B. Hierar c hical Attention Since the encoders model the speech signals differently by catching acoustic kno wledge in their own ways, encoder -lev el fusion is suitable to boost the netw ork’ s ability to retrie ve the relev ant information. W e adopt Hierarchical Attention Net- work (HAN) in [22] for information fusion. The decoder with HAN is trained to selectiv ely attend to appropriate encoder , based on the context of each prediction in the sentence as well as the higher-le vel acoustic features from encoders, to achiev e a better prediction. The letter-wise context vectors, r ( i ) l , from indi vidual en- coders are computed as follo ws: r ( i ) l = X b T ( i ) /s ( i ) c t =1 a ( i ) lt h ( i ) t , i ∈ { 1 , ..., N } (11) where the attention weights { a ( i ) lt } , where P b T ( i ) /s ( i ) c t =1 a ( i ) lt = 1 , are obtained using a content-based attention mechanism. Note that since encoders perform do wnsampling, the summa- tions are till b T ( i ) /s ( i ) c for each individual stream in Eq. (11), respectiv ely . The fusion context v ector r l is obtained as a con ve x com- bination of r ( i ) l , i ∈ { 1 , ..., N } , as illustrated in the follo wing: r l = X N i =1 β ( i ) l r ( i ) l , (12) β ( i ) l = ContentAttention ( q l − 1 , r ( i ) l ) , i ∈ { 1 , ..., N } . (13) The stream-le vel attention weight, β ( i ) l , where P N i =1 β ( i ) l = 1 , is estimated according to the previous decoder state q l − 1 and context vector r ( i ) l from an individual encoder i as described in Eq. (13). The fusion context vector is then fed into the decoder to predict the next letter . C. T r aining and Decoding with P er-encoder CTC In the CTC/Attention model with a single encoder, the CTC objectiv e serves as an auxiliary task to speed up the procedure of realizing monotonic alignment and providing a sequence- lev el objecti ve. In multi-stream framework, we introduce per- encoder CTC where a separate CTC mechanism is active for each encoder stream during training and decoding. Sharing one set of CTC among encoders is a soft constraint that limits the potential of div erse encoders to rev eal complementary information. Sharing CTC refers to the case that linear layers connecting hidden vectors to CTC Softmax layers for each encoders are shared. In the case that encoders are with dif ferent temporal resolutions and network architectures, per-encoder CTC can further align speech with labels in a monotonic order and customize the sequence modeling of individual streams. During training and decoding steps, we follow Eq. (7) and (8) with a change of the CTC objecti ve log p ctc ( C | X ) in the following w ay: log p ctc ( C | X ) = 1 N X N i =1 log p ctc ( i ) ( C | X ) , (14) where joint CTC loss is the a verage of per-encoder CTCs. In the beam search, the CTC preﬁx score of hypothesized sequence h is altered as follo ws: α ctc ( h ) = 1 N X N i =1 α ctc ( i ) ( h ) , (15) where equal weight is assigned to each CTC netw ork. D. Multi-Encoder Multi-Resolution Encoder (1) BLSTM Encoder (2) VGGBLSTM x 1 h (1) 1 , h (1) 2 , . . . , h (1) T h (2) 1 , h (2) 2 , . . . , h (2) ⌊ T /4 ⌋ … x 2 x T … … … Fig. 2: Multi-Encoder Multi-Resolution Architecture. As one realization of multi-stream framework, we propose a Multi-Encoder Multi-Resolution (MEM-Res) architecture that has two encoders, RNN-based and CNN-RNN-based. Both encoders take the same input features in parallel operating on different temporal resolutions, aiming to capture comple- mentary information in the speech as depicted in Fig. 2. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 5 The RNN-based encoder is designed to model temporal sequences with their long-range dependencies. Subsampling in BLSTM is often used to decrease the computational cost, but performing subsampling might result in lost information which could be better modeled in RNN. In MEM-Res, the BLSTM encoder has only BLSTM layers that extract the frame-wise hidden vector h (1) t without subsampling in any layer , i.e. s (1) = 1 : h (1) t = Encoder (1) ( X ) , BLSTM t ( X ) (16) where the BLSTM decoder is labeled as index 1 . The combination of CNN and RNN allo ws the con volutional feature extractor applied on the input to rev eal local correla- tions in both time and frequency dimensions. The RNN block on top of CNN makes it easier to learn temporal structure from the CNN output, to avoid modeling direct speech features with more underlying variations. The pooling layer is essential in CNN to reduce the spatial size of the representation to control ov er-ﬁtting. In MEM-Res, we use the initial layers of the VGG net architecture [52], stated in table I, followed by BLSTM layers as VGGBLSTM decoder labeled as index 2: h 2 t = Encoder 2 ( X ) , VGGBLSTM t ( X ) . (17) T wo maxpooling layers with stride = 2 do wnsample the input features by a factor of s (2) = 4 in both temporal and spectral directions. T ABLE I: Initial Six-Layer V GG Conﬁgurations Con volution 2D in = 1, out = 64, ﬁlter = 3 × 3 Con volution 2D in = 64, out = 64, ﬁlter = 3 × 3 Maxpool 2D patch = 2 × 2, stride = 2 × 2 Con volution 2D in = 64, out = 128, ﬁlter = 3 × 3 Con volution 2D in = 128, out = 128, ﬁlter = 3 × 3 Maxpool 2D patch = 2 × 2, stride = 2 × 2 E. Multi-Encoder Multi-Array In this section, we present another realization of multi- stream framew ork for the multi-array ASR task, i.e. Multi- Encoder Multi-Array (MEM-Array) model. 1) Con ventional Multi-Array ASR: In our previous work, we proposed a stream attention framew ork to improve the far- ﬁeld performance in the hybrid approach, using distributed microphone array(s) [41]. Speciﬁcally , we generated more reliable Hidden Marko v Model (HMM) state posterior proba- bilities by linearly combining the posteriors from each array stream, under the supervision of the ASR performance moni- tors. In general, the posterior combination strategy outperformed con ventional methods, such as signal-lev el fusion and the word-le vel technique R O VER [39], in the prescribed multi- array conﬁguration. Accordingly , stream attention weights estimated from the de-correlated intermediate features should be more reliable. W e adopt this assumption in MEM-Array framew ork. Array (1) Array (2) x ( 1 ) 1 … x ( 1 ) 2 x ( 2 ) T ( 2 ) … x ( 2 ) 1 x ( 2 ) 2 x ( 1 ) T ( 1 ) Encoder (2) Encoder (1) h (1) 1 , h (1) 2 , . . . , h (1) ⌊ T (1) /4 ⌋ h (2) 1 , h (2) 2 , . . . , h (2) ⌊ T (2) /4 ⌋ Beamforming Beamforming … … … Fig. 3: Multi-Encoder Multi-Array Architecture. 2) Multi-Array Ar chitectur e with Str eam Attention: Based on the multi-stream model, the proposed MEM-Array archi- tecture in Fig. 3 has two encoders, with each mapping the speech features of a single array to higher lev el representations h ( i ) t , where we denote i ∈ { 1 , 2 } as the index for Encoder ( i ) corresponding to array i . Note that Encoder (1) and Encoder (2) hav e the same conﬁgurations receiving parallel speech data collected from multiple microphone arrays. As we introduced in Sec. III-D, CNN layers are often used together with BLSTM layers on top to extract frame-wise hidden vectors. W e explore two types of encoder structures: BLSTM (RNN-based) and VGGBLSTM (CNN-RNN-based) [53]: h ( i ) t = Encoder ( i ) ( X ( i ) ) , i ∈ { 1 , 2 } (18) Encoder ( i ) () = BLSTM () or VGGBLSTM () (19) Note that the BLSTM encoders are equipped with an additional projection layer after each BLSTM layer . In both encoder architectures, subsampling f actor s (1) = s (2) = 4 is applied to decrease the computational cost. Specially , the con volution layers of the VGGBLSTM encoder downsamples the input features by a factor of 4 so that there is no subsampling in the recurrent layers. In the multi-stream setting, one inherent problem is that the contribution of each stream (array) changes dynamically . Specially , when one of the streams takes corrupted audio, the network should be able to pay more attention to other streams for the purpose of robustness. Inspired by the advances of linear posterior combination [41] and a hierarchical attention fusion [22]–[24], a stream-lev el fusion on the letter-wise context v ector is used in this w ork to achiev e the goal of encoder selectivity as we introduced in Sec. III-B. In comparison to fusion on frame-wise hidden v ectors h ( i ) t , stream-lev el fusion can deal with temporal misalignment from multiple arrays at the stream level. Furthermore, adding an extra microphone array j could be simply implemented with an additional term β ( j ) l r ( j ) l in Eq.(12). I V . E X P E R I M E N T S : M E M - R E S M O D E L A. Experimental Setup W e demonstrated our proposed MEM-Res model using two datasets: WSJ1 [54] (81 hours) and CHiME-4 [55] (18 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 6 hours). In WSJ1, we used the standard conﬁguration: “si284” for training, “de v93” for validation, and “ev al92” for test. The CHiME-4 dataset is a noisy speech corpus recorded or simulated using a tablet equipped with 6 microphones in four noisy environments: a cafe, a street junction, public transport, and a pedestrian area. For training, we used both “tr05 real” and “tr05 simu” with additional WSJ1 corpora to support end-to-end training. “dt05 multi isolated 1ch track” was used for v alidation. W e ev aluated the real recordings with 1, 2, 6-channel in the ev aluation set. The BeamformIt [56] method was applied to multi-channel ev aluation. In all experiments, 80-dimensional mel-scale ﬁlterbank coefﬁcients with additional 3-dimensional pitch features served as the input features. T ABLE II: Comparison among Single-Encoder End-to-End Models with BLSTM or VGGBSL TM as the Encoder, the MEM-Res Model and Prior End-to-End models. (WER: WSJ1, CHiME-4) CHiME-4 WSJ1 Model et05 real 1ch e val92 BLSTM (Single-Encoder) CTC 62.7 36.4 A TT 50.2 20.8 CTC+A TT 29.2 4.6 VGGBLSTM (Single-Encoder) CTC 50.6 19.1 A TT 42.2 17.2 CTC+A TT 29.6 5.6 BLSTM+VGGBLSTM (RO VER) CTC+A TT 30.8 5.9 BLSTM+VGGBLSTM (MEM-Res) CTC 49.1 15.2 A TT 44.3 18.9 CTC(shared)+A TT 26.8 4.4 CTC(shared)+A TT+HAN 26.9 4.3 CTC(per-enc)+A TT 26.6 4.1 CTC(per-enc)+A TT+HAN 26.4 3.6 Pr evious Studies RNN-CTC [3] - 8.2 Eesen [4] - 7.4 T emporal LS + Cov . [57] - 6.7 E2E+regularization [58] - 6.3 Scatt+pre-emp [59] - 5.7 Joint e2e+look-ahead LM [51] - 5.1 RCNN+BLSTM+CLDNN [60] - 4.3 EE-LF-MMI [61] - 4.1 The Encoder (1) contained four BLSTM layers, in which each layer had 320 cells in both directions followed by a 320-unit linear projection layer . The Encoder (2) combined the con v olution layers with RNN-based network that had the same architecture as Encoder (1) . A content-based attention mechanism with 320 attention units was used in encoder-lev el and frame-lev el attention mechanisms. The decoder was a one- layer unidirectional LSTM with 300 cells. W e used 50 distinct labels including 26 English letters and other special tokens, i.e., punctuations and sos/eos. W e incorporated the look-ahead word-le vel RNN-LM [51] of 1-layer LSTM with 1000 cells and 65K vocab ulary , that is, 65K-dimensional output in Softmax layer . In addition to the original speech transcription, the WSJ text data with 37M words from 1.6M sentences was supplied as training data. RNN-LM was trained separately using Stochastic Gradient Descent (SGD) with learning rate = 0 . 5 for 60 epochs. The MEM-Res model was implemented using Pytorch back- end on ESPnet [62]. T raining procedure was operated using the AdaDelta algorithm with gradient clipping on single GPUs, “GTX 1080ti”. The mini-batch size was set to be 15. W e also applied a unigram label smoothing technique to av oid over - conﬁdence predictions. The beam width was set to 30 for WSJ1 and 20 for CHiME-4 in decoding. For model jointly trained with CTC and attention objectiv es, λ = 0 . 2 was used for training, and λ = 0 . 3 for decoding. RNN-LM scaling factor γ was 1 . 0 for all experiments with the exception of using γ = 0 . 1 in decoding attention-only models. B. Results The overall experimental results on WSJ1 and CHiME- 4 are sho wn in T able II. Compared to joint CTC/Attention single-encoder models, the proposed MEM-Res model with per-encoder CTC and HAN achieved relative improvements of 9 . 6% ( 29 . 2% → 26 . 4% ) in CHiME-4 and 21.7% in WSJ1 ( 4 . 6% → 3 . 6% ) in terms of WER. W e compared the MEM-Res model with other end-to-end approaches, and it outperformed all of the systems from pre vious studies. W e also conducted experiments using R O VER technique [63] to fuse two single-encoder models in the word lev el, and our proposed models showed substantial improvements. W e designed ex- periments with ﬁxed encoder-le vel attention β 1 l = β (2) l = 0 . 5 . And the MEM-Res model with HAN outperformed the ones without parameterized stream attention. Moreo ver , per -encoder CTC constantly enhanced the performance with or without HAN. Specially in WSJ1, the model sho ws notable decrease ( 4 . 3% → 3 . 6% ) in WER with per-encoder CTC. Our results further conﬁrmed the effecti veness of joint CTC/Attention architecture in comparison to models with either CTC or attention network. T ABLE III: Comparison between the MEM-Res Model and VGGBSL TM Single-Encoder Model with Similar Network Size. (WER: WSJ1, CHiME-4) Single-Encoder Proposed Model Data (21.9M) (21.3M) CHiME-4 et05 real 1ch 32.2 26.4 (18.0%) et05 real 2ch 26.8 21.9 (18.3%) et05 real 6ch 21.7 17.2 (20.8%) WSJ1 ev al92 5.3 3.6 (32.1%) For fair comparison, we increased the number of BLSTM layers from 4 to 8 in Encoder (2) to train a single-encoder model. In T able III, the MEM-Res system outperforms the single-encoder model by a signiﬁcant margin with similar amount of parameters, 21 . 9 M v .s. 21 . 3 M. In CHiME-4, we ev aluated the model using real test data from 1, 2, 6-channel resulting in an average of 19% relativ e improvement from all JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 7 three setups. In WSJ1, we achie ved 3.6% WER in ev al92 in our MEM-Res frame work with relati vely 32.1% impro vement. T ABLE IV: Ef fect of Multi-Resolution Conﬁguration ( s (1) , s (2) ) , where s (1) and s (2) are Subsampling Factors for Encoder (1) and Encoder (2) . (WER: WSJ1, CHiME-4) Data (4,4) (2,4) (1,4) CHiME-4 et05 real 1ch 29.1 27.0 26.4 WSJ1 ev al92 4.5 4.2 3.6 The results in T able IV sho ws the contribution of multiple resolution. The WER went up when increasing subsampling factor s (1) closer to s (2) = 4 in both datasets. In other words, the fusion worked better when two encoders are more hetero- geneous which supports our hypothesis. As shown in T able V, W e analyzed the av erage stream-level attention weight for Encoder (2) when we gradually decreased the number of LSTM layers while keeping Encoder (1) with the original conﬁguration. It aimed to sho w that HAN was able to attend to the appropriate encoder seeking for the right knowledge. As suggested in the table, more attention goes to Encoder (1) from Encoder (2) as we intentionally make Encoder (2) weaker . T ABLE V: Analysis of Hierarchical Attention Mechanism when Fixing Encoder (1) and Changing the Number of LSTM Layers in Encoder (2) . (WER: CHiME-4) # LSTM Layers A verage Stream Attention in VGGBLSTM for VGGBLSTM WER % 0 0.27 30.6 1 0.52 29.8 2 0.75 28.9 3 0.82 27.8 4 0.81 26.4 V . E X P E R I M E N T S : M E M - A R R AY M O D E L A. Experimental Setup T wo dataset, AMI Meeting Corpus and DIRHA English WSJ, were used to demonstrate MEM-Array model. The AMI meeting corpus [31] was created in three instrumented meeting rooms focusing on de veloping meeting bro wsing technology . There are 100 hours of far-ﬁeld signal-synchronized recordings collected using two microphone arrays placed in each room. The training, development and e valuation set are comprised of 81 hours, 9 hours and 9 hours of meeting recordings, re- spectiv ely . The DIRHA English WSJ [32] was part of DIRHA project which addresses the challenge of speech interaction via distant microphones. A total of 32 microphones were used in a domestic en vironment of a living room and a kitchen. T wo microphone arrays, a circular array and a linear array in the living room, were chosen as parallel streams. Contaminated version of the original WSJ0 and WSJ1 corpus was used for training, providing room impulse responses for correspoding arrays. De velopment set for cross v alidation was simulated with typical domestic background noise and rev erberation. Evaluation set has 409 read utterances from WSJ text recorded by six native English speakers in real domestic setting. For both datasets, two microphone arrays (noted by Str1 and Str2) were applied to train a MEM-Array model, where conﬁg- uration of arrays for each dataset is described in T able VI. Note that for each array , multi-channel input was synthesized into a single-channel audio using Delay-and-Sum beamforming technique with BeamformIt T oolkit [56]. Experiments were conducted with conﬁguration as described in T able VII. T ABLE VI: Description of the Array Conﬁguration in the T wo-Stream E2E Experiments. Dataset Str1 (Stream 1) Str2 (Stream 2) Edinbur gh: 8-mic Circular Array AMI 8-mic Circular Array Idiap: 4-mic Circular Array TNO: 10-mic Linear Array DIRHA 6-mic Circular Array 11-mic Linear Array T ABLE VII: Experimental Conﬁguration (MEM-Array) Featur e Single Stream 80-dim fbank + 3-dim pitch Multi Stream Array (1) :80+3; Array (2) :80+3 Model Encoder type BLSTM or VGGBLSTM Encoder layers BLSTM:4; VGGBLSTM [53]:6(CNN)+4 Encoder units 320 cells (BLSTM layers) (Stream) Attention Content-based Decoder type 1-layer 300-cell LSTM CTC weight λ (train) AMI:0.5; DIRHA:0.2 CTC weight λ (decode) AMI:0.3; DIRHA:0.3 RNN-LM T ype Look-ahead W ord-level RNNLM [51] Train data AMI:AMI; DIRHA:WSJ0-1+extra WSJ text data LM weight γ AMI:0.5; DIRHA:1.0 B. Results Similar to experiments in MEM-Res session, we started with discussion on single-stream architecture, followed by analysis of the effecti veness of our proposed MEM-Array model. Results for single-array models are summarized in T a- ble VIII. By comparing two encoder architectures on both datasets, VGGBLSTM noticeably outperforms BLSTM as encoder type. With the help of CTC and an external RNNLM, substantial improvements were observed throughout all the cases of Stream 1. The architecture with the best performance (VGGBLSTM+CTC+A TT+RNNLM) was chosen for further experiments on Stream 2 in T able VIII. As illustrated in T able IX, our proposed framew ork was able to fuse information successfully from both streams by achiev- ing lower error rates than best single-array systems, i.e., AMI ( 56 . 9% → 54 . 9% ) and DIRHA ( 35 . 1% → 31 . 7% ). More- ov er , se veral con ventional fusion strategies were discussed in T able IX: signal-level fusion through W A V alignment and av erage; feature-level frame-by-frame concatenation; word- lev el prediction fusion using RO VER. The MEM-Array model outperformed all three fusion techqinues, ev en including the case when doubling BLSTM layers in singal-lev el fusion for a comparable amount of parameters (33.7M vs 31.6M). JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 8 T ABLE VIII: Exploration of Best Encoder and Decoding Strategy for Single-Stream E2E Model. AMI DIRHA Model (Single Stream) Eval 0 Real CER WER CER WER BLSTM (Str1) Attention 45.1 60.9 42.7 68.7 + CTC 41.7 63.0 38.5 74.8 + W ord RNNLM 41.7 59.1 29.4 47.4 VGGBLSTM (Str1) Attention 43.2 59.7 39.5 71.4 + CTC 40.2 62.0 30.1 61.8 + W ord RNNLM 39.6 56.9 21.2 35.1 VGGBLSTM (Str2) 45.6 64.0 22.5 38.4 T ABLE IX: WER(%) Comparison between Proposed Multi- Stream Approach and Alternati ve Single-Stream Strategies. Encoder VGGBLSTM #Param AMI DIRHA (Att + CTC + RNNLM) Eval Real Single-str eam model Concatenating Str1&Str2 23.3M 56.7 33.5 W A V alignment and average 26.2M 56.7 43.5 + model parameter extension 33.7M 56.9 39.6 T wo single-str eam models R O VER Str1&Str2 52.5M(26.2 × 2) 60.7 37.0 Multi-str eam model Proposed framework 31.6M 54.9 31.7 T o in vestigate robustness of stream attention, we designed an e xperiment with Str1 injected with zero-mean, unit-v ariance Gaussian noise in the signal le vel while keeping Str2 un- touched. Fig.4 displays an example from DIRHA ev aluation set during inference. Noise corruption Str1 ( ( a ) → ( c ) ) made attention alignments fairly blurred, thus less trusted. As expected, an averagely positiv e shift of stream attention weights tow ards Str2 was observed. T able X sho ws fusion results of six streams in hybrd ASR system from our previous study [41]. Relati ve WER reductions of 7.2% and 5.8% were reported comparing to the best single stream performance. Meanwhile, MEM-Array system with two streams reduced the WER by 9.7% relativ ely . In spite of more training data in volv ed in E2E, MEM-Array shows a promising direction for fusion of more streams. T ABLE X: WER(s) Comparison between the Hybrid and End- to-End System on DIRHA Dataset. #Num Denotes the Number of Streams. System #Num Method Best Stream WER Hybrid 6 post. comb. 29.2 27.1 ( 7.2%) 6 RO VER 29.2 27.5 ( 5.8%) E2E 2 proposed 35.1 31.7 ( 9.7%) V I . C O N C L U S I O N In this work, we present our multi-stream frame work to build an end-to-end ASR system. Higher-le vel frame-wise Fig. 4: Comparison of the alignments between characters (y- axix) and acoustic frames (x-axis) before (( a ) Str1; ( b ) Str2) and after (( c ) Str1; ( d ) Str2) noise corruption of Str1. ( e ) sho ws the attention weight shift of Str2 between two cases (x-axis is the letter sequence). acoustic features were carried out from parallel encoders with various conﬁgurations of input features, architectures and temporal resolutions. Stream attention was achiev ed through a hierarchical connection between the decoder and encoders. W e also in vestigated that assigning a CTC network to individual encoder further helped diverse encoders to rev eal complemen- tary information. T wo realizations of multi-stream frame work hav e been pro- posed, which are MEM-Res model and MEM-Array model tar - geting dif ferent applications. In MEM-Res architecure, RNN- based and CNN-RNN-based encoders with subsampling only in con v olutional layers characterized same speech in different ways. The model outperformed v arious single-encoder models, reaching the state-of-the-art performance on WSJ among end- to-end systems. F or further study , exploring hierarchical feed- back from different decoder layers and adv anced con volutional layers, such ResNet, and self-attention layers ha ve the potential to improv e the WER ev en more. In multi-array scenarios, taking advantage of all the information that each array shared and contributed was crucial in this task. The MEM-Array model represent each array with one encoder followed by attention fusion in the contextual vector le vel, where no frame synchronization of parallel stream was required. Thanks to the success of joint training of per -encoder CTC and attention, substantial WER reduction was shown in both AMI and DIRHA corpora, demonstrating the potentials of the proposed architecture. An extension to more streams efﬁciently and exploration of schedule training of the encoders are to be in vestigated. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 9 A C K N O W L E D G M E N T This work is supported by National Science Foundation under Grant No. 1704170 and No. 1743616, and a Google faculty aw ard to Hynek Hermansky . R E F E R E N C E S [1] G. Hinton, L. Deng, D. Y u, G. Dahl, A.-r . Mohamed, N. Jaitly , A. Senior, V . V anhoucke, P . Nguyen, B. Kingsbury et al. , “Deep neural networks for acoustic modeling in speech recognition, ” IEEE Signal pr ocessing magazine , vol. 29, 2012. [2] A. Graves, S. Fern ´ andez, F . Gomez, and J. Schmidhuber , “Connection- ist temporal classiﬁcation: labelling unse gmented sequence data with recurrent neural networks, ” in Proc. of ICML , 2006, pp. 369–376. [3] A. Graves and N. Jaitly , “T ow ards end-to-end speech recognition with recurrent neural networks, ” in Proc. of ICML , 2014, pp. 1764–1772. [4] Y . Miao, M. Gowayyed, and F . Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding, ” in Pr oc. of ASR U , 2015, pp. 167–174. [5] W . Chan, N. Jaitly , Q. V . Le, and O. V inyals, “Listen, attend and spell: A neural network for large vocabulary conv ersational speech recognition, ” in Pr oc. of ICASSP , 2015. [6] J. K. Choro wski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “ Attention-based models for speech recognition, ” in Pr oc. of NIPS , 2015, pp. 577–585. [7] A. Graves, “Sequence transduction with recurrent neural networks, ” in Pr oc. of ICML W orkshop on Representation Learning , 2012. [8] A. Graves, A.-r . Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks, ” in Pr oc. of ICASSP . IEEE, 2013, pp. 6645–6649. [9] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonic attention mechanism for end-to-end speech and language processing, ” in IJCNLP , 2017. [10] J. Hou, S. Zhang, and L.-R. Dai, “Gaussian prediction based attention for online end-to-end speech recognition, ” in Pr oc. of INTERSPEECH , 2017. [11] T . Luong, H. Pham, and C. D. Manning, “Effecti ve approaches to attention-based neural machine translation, ” in Pr oc. of EMNLP . Lis- bon, Portugal: Association for Computational Linguistics, Sep. 2015, pp. 1412–1421. [12] C. Raf fel, M.-T . Luong, P . J. Liu, R. J. W eiss, and D. Eck, “Online and linear-time attention by enforcing monotonic alignments, ” in Pr oc. of ICML . JMLR. org, 2017, pp. 2837–2846. [13] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention, ” in Proc. of ICLR , 2018. [14] S. Kim, T . Hori, and S. W atanabe, “Joint CTC-attention based end-to- end speech recognition using multi-task learning, ” in Pr oc. of ICASSP , 2017, pp. 4835–4839. [15] T . Hori, S. W atanabe, Y . Zhang, and W . Chan, “ Advances in joint CTC- attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM, ” in Proc. of INTERSPEECH , 2017. [16] S. W atanabe, T . Hori, S. Kim, J. R. Hershey , and T . Hayashi, “Hy- brid ctc/attention architecture for end-to-end speech recognition, ” IEEE Journal of Selected T opics in Signal Processing , v ol. 11, no. 8, pp. 1240–1253, 2017. [17] S. H. R. Mallidi, “ A practical and efﬁcient multistream framew ork for noise robust speech recognition, ” Ph.D. dissertation, Johns Hopkins Univ ersity , 2018. [18] H. Hermansky , “Multistream recognition of speech: Dealing with un- known unknowns, ” Pr oceedings of the IEEE , vol. 101, no. 5, pp. 1076– 1088, 2013. [19] S. H. Mallidi and H. Hermansky , “Novel neural network based fusion for multistream asr, ” in Proc. of ICASSP . IEEE, 2016, pp. 5680–5684. [20] H. Hermansk y , “Coding and decoding of messages in human speech communication: Implications for machine recognition of speech, ” Speech Communication , 2018. [21] X. W ang, R. Li, S. H. Mallidi, T . Hori, S. W atanabe, and H. Hermansky , “Stream attention-based multi-array end-to-end speech recognition, ” in Pr oc. of ICASSP . IEEE, 2019, pp. 7105–7109. [22] Z. Y ang, D. Y ang, C. Dyer, X. He, A. Smola, and E. Hovy , “Hierarchical attention networks for document classiﬁcation, ” in NAA CL HLT , 2016, pp. 1480–1489. [23] C. Hori, T . Hori, T .-Y . Lee, Z. Zhang, B. Harsham, J. R. Hershey , T . K. Marks, and K. Sumi, “ Attention-based multimodal fusion for video description, ” in Pr oc. of ICCV . IEEE, 2017, pp. 4203–4212. [24] J. Libovick ` y and J. Helcl, “ Attention strategies for multi-source sequence-to-sequence learning, ” in Proc. of A CL , vol. 2, 2017, pp. 196– 202. [25] T . Hayashi, S. W atanabe, T . T oda, and K. T akeda, “Multi-head decoder for end-to-end speech recognition, ” in Pr oc. of INTERSPEECH , 2018, pp. 801–805. [26] C.-C. Chiu, T . N. Sainath, Y . W u, R. Prabha valkar , P . Nguyen, Z. Chen, A. Kannan, R. J. W eiss, K. Rao, E. Gonina et al. , “State-of-the-art speech recognition with sequence-to-sequence models, ” in Pr oc. of ICASSP . IEEE, 2018, pp. 4774–4778. [27] A. V aswani, N. Shazeer , N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “ Attention is all you need, ” in NIPS , 2017, pp. 5998–6008. [28] G. Pundak, T . N. Sainath, R. Prabhavalkar , A. Kannan, and D. Zhao, “Deep context: end-to-end contextual speech recognition, ” in Proc. of SLT . IEEE, 2018, pp. 418–425. [29] S. Kim and F . Metze, “Dialog-context aware end-to-end speech recog- nition, ” in Pr oc. of SLT . IEEE, 2018, pp. 434–440. [30] Y . Zhang, W . Chan, and N. Jaitly , “V ery deep conv olutional networks for end-to-end speech recognition, ” in Pr oc. of ICASSP , 2017. [31] J. Carletta, S. Ashby , S. Bourban, M. Flynn, M. Guillemot, T . Hain, J. Kadlec, V . Karaiskos, W . Kraaij, M. Kronenthal et al. , “The ami meeting corpus: A pre-announcement, ” in Proc. of MLMI . Springer , 2005, pp. 28–39. [32] M. Ravanelli, P . Sv aizer, and M. Omologo, “Realistic multi-microphone data simulation for distant speech recognition, ” in Pr oc. of INTER- SPEECH , 2016. [33] J. Barker, S. W atanabe, E. V incent, and J. T rmal, “The ﬁfth ’chime’ speech separation and recognition challenge: Dataset, task and base- lines, ” in Pr oc. of Interspeech , 2018, pp. 1561–1565. [34] E. V incent, S. W atanabe, A. A. Nugraha, J. Barker , and R. Marxer , “ An analysis of environment, microphone and data simulation mismatches in robust speech recognition, ” Computer Speech & Languag e , vol. 46, pp. 535–557, 2017. [35] Z. W ang, X. W ang, X. Li, Q. Fu, and Y . Y an, “Oracle performance in vestigation of the ideal masks, ” in IW AENC 2016 . IEEE, 2016, pp. 1–5. [36] S. Markovich-Golan, A. Bertrand, M. Moonen, and S. Gannot, “Opti- mal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks, ” Signal Pr ocessing , vol. 107, pp. 4–20, 2015. [37] J. Du et al. , “The ustc-iﬂytek systems for chime-5 challenge, ” in CHiME- 5 , 2018. [38] N. Kanda et al. , “The hitachi/jhu chime-5 system: Advances in speech recognition for e veryday home environments using multiple microphone arrays, ” in CHiME-5 , 2018. [39] J. G. Fiscus, “ A post-processing system to yield reduced w ord error rates: Recognizer output voting error reduction (rover), ” in Pr oc. of ASR U . IEEE, 1997, pp. 347–354. [40] X. W ang, Y . Y an, and H. Hermansky , “Stream attention for far-ﬁeld multi-microphone asr, ” arXiv pr eprint arXiv:1711.11141 , 2017. [41] X. W ang, R. Li, and H. Hermansky , “Stream attention for distributed multi-microphone speech recognition, ” in Proc. of INTERSPEECH , 2018, pp. 3033–3037. [42] H. Misra, H. Bourlard, and V . T yagi, “New entropy based combination rules in hmm/ann multi-stream asr, ” in Pr oc. of ICASSP , vol. 2. IEEE, 2003, pp. II–741. [43] F . Xiong et al. , “Channel selection using neural network posterior probability for speech recognition with distrib uted microphone arrays in everyday en vironments, ” in CHiME-5 , 2018. [44] S. H. Mallidi, T . Ogawa, and H. Hermansky , “Uncertainty estimation of dnn classiﬁers, ” in Proc. of ASR U . IEEE, 2015, pp. 283–288. [45] T . Ochiai, S. W atanabe, T . Hori, J. R. Hershey , and X. Xiao, “Uniﬁed architecture for multichannel end-to-end speech recognition with neural beamforming, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 11, no. 8, pp. 1274–1288, 2017. [46] S. Braun, D. Neil, J. Anumula, E. Ceolini, and S.-C. Liu, “Multi-channel attention for end-to-end speech recognition, ” in Pr oc. of INTERSPEECH , 2018, pp. 17–21. [47] T . Ochiai, S. W atanabe, T . Hori, and J. R. Hershey , “Multichannel end- to-end speech recognition, ” in Proc. of ICML . JMLR. or g, 2017, pp. 2632–2641. [48] S. Kim and I. Lane, “End-to-end speech recognition with auditory attention for multi-microphone distance speech recognition, ” in Pr oc. of INTERSPEECH , 2017, pp. 3867–3871. [49] A. Gra ves, “Supervised sequence labelling with recurrent neural net- works, ” Ph.D. dissertation, Univ ersit ¨ at M ¨ unchen, 2008. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 10 [50] T . Hori, S. W atanabe, and J. Hershey , “Joint ctc/attention decoding for end-to-end speech recognition, ” in Pr oc. of ACL , 2017, pp. 518–529. [51] T . Hori, J. Cho, and S. W atanabe, “End-to-end speech recognition with word-based rnn language models, ” in Pr oc. of SLT . IEEE, 2018, pp. 389–396. [52] K. Simon yan and A. Zisserman, “V ery deep con volutional networks for large-scale image recognition, ” arXiv preprint , 2014. [53] J. Cho, M. K. Baskar, R. Li, M. Wiesner , S. H. Mallidi, N. Y alta, M. Karaﬁat, S. W atanabe, and T . Hori, “Multilingual sequence-to- sequence speech recognition: architecture, transfer learning, and lan- guage modeling, ” in Proc. of SLT , 2018. [54] L. D. Consortium, “CSR-II (wsj1) complete, ” Linguistic Data Consor- tium, Philadelphia , vol. LDC94S13A, 1994. [55] E. V incent, S. W atanabe, J. Barker , and R. Marx er, “The 4th chime speech separation and recognition challenge, ” 2016. [56] X. Anguera, C. W ooters, and J. Hernando, “ Acoustic beamforming for speaker diarization of meetings, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 15, no. 7, pp. 2011–2022, 2007. [57] J. Chorowski and N. Jaitly , “T o wards better decoding and language model integration in sequence to sequence models, ” arXiv pr eprint arXiv:1612.02695 , 2016. [58] Y . Zhou, C. Xiong, and R. Socher , “Impro ved re gularization techniques for end-to-end speech recognition, ” arXiv preprint , 2017. [59] N. Zeghidour , N. Usunier , G. Synnaeve, R. Collobert, and E. Dupoux, “End-to-end speech recognition from the raw waveform, ” arXiv pr eprint arXiv:1806.07098 , 2018. [60] Y . W ang, X. Deng, S. Pu, and Z. Huang, “Residual con volu- tional ctc networks for automatic speech recognition, ” arXiv pr eprint arXiv:1702.07793 , 2017. [61] H. Hadian, H. Sameti, D. Povey , and S. Khudanpur, “End-to-end speech recognition using lattice-free mmi, ” Pr oc. of INTERSPEECH , pp. 12–16, 2018. [62] S. W atanabe, T . Hori, S. Karita, T . Hayashi, J. Nishitoba, Y . Unno, N. Enrique Y alta Soplin, J. He ymann, M. W iesner , N. Chen, A. Renduch- intala, and T . Ochiai, “Espnet: End-to-end speech processing toolkit, ” in Pr oc. of INTERSPEECH , 2018, pp. 2207–2211. [63] J. G. Fiscus, “ A post-processing system to yield reduced w ord error rates: Recognizer output voting error reduction (ro ver), ” in Pr oc. of ASR U , Dec 1997, pp. 347–354. Ruizhi Li is a Ph.D. student at Johns Hopkins Univ ersity since 2014. His research interests include machine learning and spoken language processing. He received his B.E. degree in Electrical Engineer- ing in Beijing Uni versity of Chemical T echnology in 2012, and M.S. degree in Electrical Engineering from W ashington University in St. Louis in 2014. He is a student member of the IEEE. Xiaofei W ang is a postdoctoral research fello w of Center for Language and Speech Processing at Johns Hopkins Uni versity in Baltimore, MD, USA, since 2016. He receiv ed the Ph.D. from University of Chinese Academy of Sciences in 2015 and B.E. from Huazhong Univ ersity of Science and T echnology , China in 2010. From 2015 to 2016, he was an Assistant Professor at Institute of Acoustics, Chinese Academy of Sciences. His research interests are far-ﬁeld automatic speech recognition and speech enhancement. He is member of IEEE and ISCA. Sri Harish Mallidi is an applied scientist in Ama- zon, Seattle, USA, where he is working on algo- rithms and technologies for large-scale, real-time au- tomatic speech recognition systems. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University in 2018 with Prof. Hynek Hermansky . Prior to this, he obtained his B.T ech (2008) and M.S. (2010) in Elec- tronics and Communications from International In- stitute of Information T echnology , Hyderabad (IIIT - H), India. His research interests include machine learning methods for speech recognition, speech activity detection, keyword spotting, and speaker recognition and diarization. Shinji W atanabe is an Associate Research Profes- sor at Johns Hopkins Univ ersity , Baltimore, MD, USA. He recei ved his B.S., M.S. PhD (Dr . Eng.) Degrees in 1999, 2001, and 2006, from W aseda Univ ersity , T ok yo, Japan. He was a research scien- tist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar in Georgia institute of technology , Atlanta, GA in 2009, and a Senior Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA from 2012 to 2017. His research interests include automatic speech recognition, speech enhancement, spoken language understand, and machine learning for speech and language pro- cessing. He has been published more than 150 papers in top journals and conferences, and received several awards including the best paper a ward from the IEICE in 2003. He served an Associate Editor of the IEEE T ransactions on Audio Speech and Language Processing, and is a member of several technical committees including the IEEE Signal Processing Society Speech and Language T echnical Committee (SL TC) and Machine Learning for Signal Processing T echnical Committee (MLSP). T akaaki Hori (SM’14) receiv ed the B.E. and M.E. degrees in electrical and information engineering from Y amagata Univ ersity , Y onezawa, Japan, in 1994 and 1996, respectiv ely , and the Ph.D. degree in system and information engineering from Y amagata Univ ersity in 1999. From 1999 to 2015, he had been engaged in researches on speech recognition and spoken language understanding at Cyber Space Laboratories and Communication Science Labora- tories in Nippon T ele graph and T elephone (NTT) Corporation, Japan. He was a visiting scientist at the Massachusetts Institute of T echnology (MIT) from 2006 to 2007. Since 2015, he has been a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA. He has coauthored more than 100 peer -revie wed papers in speech and language research ﬁelds. He receiv ed the 24th TELECOM System T echnology A ward from the T elecommunications Adv ancement Foundation in 2009, the IPSJ Kiyasu Special Industrial Achie vement A ward from the Information Process- ing Society of Japan in 2012, and the 58th Maejima Hisoka A ward from Tsushinbunka Association in 2013. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 11 Hynek Hermansky (LF17, F’01, SM’92. M’83, SM’78) receiv ed the Dr . Eng. Degree from the Univ ersity of T okyo, and Dipl. Ing. Degree from Brno University of T echnology , Czech Republic. He is the Julian S. Smith Professor of Electrical Engineering and the Director of the Center for Lan- guage and Speech Processing at the Johns Hopkins Univ ersity in Baltimore, Maryland. He is also a Professor at the Brno Univ ersity of T echnology , Czech Republic. He has been working in speech processing for ov er 30 years. His main research interests are in acoustic processing for speech recognition. He is a Life Fello w of IEEE, and a Fellow of the International Speech Communication Association (ISCA), He is the General Chair of the INTER- SPECH 2021, was the General Chair of the 2013 IEEE Automatic Speech Recognition and Understanding W orkshop, was in char ge of plenary sessions at the 2011 ICASSP in Prague, was the T echnical Chair at the 1998 ICASSP in Seattle and an Associate Editor for IEEE Transaction on Speech and Audio. He is also a Member of the Editorial Board of Speech Communication, was twice an elected Member of the Board of ISCA, a Distinguished Lecturer for IEEE, a Distinguished Lecturer for ISCA, and the recipient of the 2013 ISCA Medal for Scientiﬁc Achiev ement.

Multi-Stream End-to-End Speech Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment