Gated Recurrent Unit Based Acoustic Modeling with Future Context
The use of future contextual information is typically shown to be helpful for acoustic modeling. However, for the recurrent neural network (RNN), it's not so easy to model the future temporal context effectively, meanwhile keep lower model latency. I…
Authors: Jie Li, Xiaorui Wang, Yuanyuan Zhao
Gated Recurr ent Unit Based Acoustic Modeling with Future C ontext Jie Li 1 , Xiaorui W ang 1 , Y uanyuan Zhao 2 , Y an Li 1 1 Kwai, Be ijing, P .R. C hina 2 Institute of Automation, Chinese Academy of Sciences, Beijing, P .R.China { lijie03, wangxiaorui, liyan } @kuaisho u.com, yyzhao5231@ia.ac.cn Abstract The use of futu re contextua l information is typically sho wn to be helpful for acoustic modeling. Ho we ver , for the recurrent neural network (RNN), it’ s not so easy to model the future tem- poral conte xt effecti vely , meanwhile keep lower model latency . In t his paper , we attempt to design a RNN acoustic model that being capable of utilizing the future context effecti vely an d di- rectly , with the model latency and computation cost as low as possible. The pro posed model is based on the minimal gated re- current unit (mGR U) with an input p rojection layer inserted in it. T wo conte xt modules, temporal encoding and temporal con- volution , are specifically designed f or this architecture to model the future contex t. Experimental results on the Switchboard task and an internal Mandarin ASR task show that, the pro- posed model performs much better than long short-term mem- ory (LST M) and mGR U models, whereas enables online decod- ing wit h a maximum latenc y of 170 ms. T his mo del even out- performs a v ery strong baseline, TDNN-L S TM, with smaller model latency and almost half less parameters. Index T e rms : speech recognition, acoustic modeling, future temporal context, gated recurrent unit 1. Introdu ction It is typically shown to be beneficial f or acoustic modeling to make full use of the future contextual information. In the liter- ature, there are variety of methods to realize this idea for dif- ferent model architectures. For feed-forward neural network (FFNN), this context is usually provided by splicing a fixed set of future frames in the input representation[1]. It also e xists other approach es relating modifying F FNN model structures. The authors in [2, 3] proposed a model called feedforward se- quential memory networks (FS MN), which is a standard F FNN equipped with some learnable memory blocks in the hidden lay- ers to encode the long context information into a fixed-size rep- resentation. T he time delay neural network (TDNN ) [4, 5] is an- other F FNN architecture which has been sho wn to be effectiv e in modeling lon g range depen dencies through temporal conv o- lution over context. As for unidirectional recurren t neural netwo rk (R NN ) , this is usually accomp lished using a delayed prediction of the out- put labels[6]. Ho wever , this method only provides quite limited modeling po wer of future context, as sho wn in [7]. While for bidirectional R NN, this is accomplished by processing the data in the backward direction using a separate RNN layer [8 , 9, 10]. Although the bidirectional versions hav e been shown to outper- form the unidirectional ones with a large margin [11, 12], the latency of bidirectional models is significantly larg er , making them unsuitable for online speech recognition. T o overcome this limitation, chunk based training and decoding schemes such as conte xt-sensitive-chu nk (CS C) [13, 14] and lat ency- controlled (LC) BLSTM [11, 15 ] hav e been i nv esti gated. How- e ver , the model latency is still quite high, since in all these on- line v ariants, inference is restricted to chunk-le vel increments to amortize the compu tation cost of backward RNN. For ex- ample, the decoding latency of LC-BLSTM in [15] is about 600 ms, which is the sum of chunk si ze N c and future con- text frames N r . T o ov ercome the shortcomings of the chunk- based methods, Peddinti et al. [7] propose d the use of t empo- ral con volu tion, in the form of TDNN layers, for modeling the future temporal context while affo rding inference with frame- lev el increments. The proposed model is called TDNN-LST M, and is designed by interleaving of temporal con volution (TDNN layers) with unidirectional long short-term memory (LSTM) [16, 17, 18, 19] layers. This model was shown to outperform bidirectional LSTM in two automatic speech recognition (ASR) tasks, while enabling online decoding with a maximum latency of 200 ms [7]. Ho we ver , TDNN-LS T M’ s ability to model the future con- text comes from the TDNN part, whereas the LSTM itself is incapable of utilizing the future information effectiv ely . In t his paper , we attempt to design a RNN acoustic mode l that can model the future context ef fectiv ely and directly , without the dependen ce on extra layers, for instance, TDNN layers. In ad- dition, the model lat ency and computation cost should be as low as possible. W i th this purpose, we choose to use the minimal gated recurrent unit (mGRU) [20] as our base RNN model in this work. mGRU is a re vised version of GRU [21, 22 ] and con- tains only one multiplicati ve gate, making the computational cost of mGR U much smaller than GRU and vanilla LSTM [19]. Based on mGRU, we propose to insert a linear input projection layer to mGRU, getting a model called mGR UIP . The inserted linear projection layer compresses the inpu t vector an d hidde n state vector simultaneously . Si nce the size of t his layer is much smaller than cell number , mGRUIP contains much less parame- ters than mGRU. In addition to this, t here are two other adv an- tages of the input projection layer . The first one is that inserting this layer is beneficial to the ASR performanc e. Our exp eri- ments on a 309-hour Swit chboard t ask show that mGRUIP out- performs mGRU significantly . This finding is consistent with that in LSTM with input projection layer (LSTMIP) [23]. The second (also the most important) advantag e is that this input projection forms a bottleneck in the recurrent layer, mak- ing it possible t o design a module on it, that can utilize the fu- ture contex t information effecti vely , mean while without sign if- icantly i ncreasing the model size. In this work, we des ign two kinds of context modules specifically for mGR UIP , making it capable of modeling future temporal context effecti vely and di- rectly . T he first module i s referred to as tempor al encoding , in which one mGR UI P l ayer is equipped with a context block to encode the future context information into a fixed-size repre- sentation, similar with FSMN. T emporal encoding is performed at the input projection layer, making the increase of computa- tion cost quite small. The second module borro ws the idea from TDNN, and is called tempora l con volution as the transforms in it are tied across time st eps. In temporal con volution, future contex t i nformation from several frames is spliced together and compressed by the inpu t projection layer . Thanks to the small dimensionality of the projection, temporal con volution brings quite limited additional parameters. In this work, these two con- text modules are shown to be quite effec tiv e on two AS R tasks, while maintaining lo w latency (17 0 ms) online decoding. I t is sho wn that compared with LS TM and mGRU, mGR UIP wi th temporal con volution provides more than 13% relative WER re- duction on the full S witchboard Hub5’00 test set, while on our 1400-hou r internal Mandarin ASR task, the relativ e gain is 13% to 24% for different test sets. What’ s more, the proposed model outperforms TDNN-LSTM with smaller decoding latency and almost half less parameters. This paper is organized as follo ws. Section 2 describes the model architecture of GR U and its va riants, including the pro- posed mGR U I P and the two conte xt modules. The related work is introduced in Section 3. W e report our ex perimental results on two ASR tasks in Section 4 and conclude this work in Sec- tion 5. 2. Model Ar chitecures In this section, we wil l first make a brief introduction to the model structure of GR U and mGRU. Then the proposed mGR UIP and two context modules wi l l be introduced in detail. 2.1. GR U The GRU model is defined by the f oll owing equations (the layer index l has been omitted for simplicity): z t = σ ( W z x t + U z h t − 1 + b z ) (1) r t = σ ( W r x t + U r h t − 1 + b r ) (2) e h t = tanh ( W h x t + U h ( h t − 1 ∗ r t ) + b h ) (3) h t = z t ∗ h t − 1 + (1 − z t ) ∗ e h t (4) In particular, z t and r t are vec tors corresponding to the update and reset gates respecti vely , where ∗ denotes element- wise multiplication. The activ ations of both gates are element- wise logistic si gmoid functions σ ( · ) , constraining the values of z t and r t ranging from 0 to 1. h t represents the output state vector for the current time frame t , while e h t is the candidate state obtained with a hyperbolic t angent. The network is fed by the current input vector x t (speech features or output vector of pre vious layer), and the pa rameters of the model are W z , W r , W h (the feed -forward con nections), U z , U r , U h (the recu rrent weights), and the bias vectors b z , b r , b h . 2.2. mGR U mGR U, short for minimal G RU, is a revised version of the GRU described abov e. It is p roposed by [20] and contains t wo mod- ifications: removing the reset gate and replacing the hyperbolic tangent function with ReLU activ ati on. T hus it leads to the fol- lo wing update equations: z t = σ ( W z x t + U z h t − 1 + b z ) (5) e h t = R eLU(BN( W h x t + U h h t − 1 ) + b h ) (6) h t = z t ∗ h t − 1 + (1 − z t ) ∗ e h t (7) where BN means batch normalization. 2.3. mGR UIP In this work, a no vel model called mGR UIP is proposed by in- serting a linear input projection layer into mGR U. In mGRUIP , the output state vector h t is calculated from the input vector x t by the follo wing equations: v t = W v [ x t ; h t − 1 ] (8) z t = σ ( W z v t + b z ) (9) e h t = R eLU(BN( W h v t ) + b h ) (10) h t = z t ∗ h t − 1 + ( 1 − z t ) ∗ e h t (11) In mGRUIP , the current input vector x t and the pre vious output state v ector h t − 1 , are concatenated together and com- pressed into a lower dimensional projected vector v t by wei ght matrices W v . Then the update gate activ ation z t and the candi- date state vector e h t are calculated based on the projected vector v t . mGR UIP can reduce the parameters of mGRU significantly . The total number of parameters in a standard mGR U netwo rk, ignoring the biases, can be computed as follows: N mGRU = n i × n c × 2 + n c × n c × 2 where n c is the numbe r of hidden neurons, n i the number of in- put units, and N mGRU is the total parameter number of mGRU. While for mGR UIP , t his value becomes: N mGRU I P = ( n i + n c ) × n p + n p × n c × 2 where n p is the number of units in the input projection layer . Assuming n c equal with n i , t he ratio of these two numbers is: N mGRU I P N mGRU = n p n c In a typical configuration we can set n c = 1024 and n p = 512 , hence the parameters of mGR UIP is just half of mGRU, making the computation quite efficient. Despite t his, our experiments on Swi tchboard task sho w that mGR UIP outperforms mGR U with the same number of neurons, i.e., n c . What’ s more, in- creasing n c while decreasing n p can further enlarge the gains. 2.4. mGR UIP with Context Modu le The input projection layer forms a bottleneck in mGRUIP , mak- ing it easier to utilize the future context effecti vely , in the mean- time k eep the increas e of model size acceptable. In this pa per , two kinds of context module, namely temporal encoding and temporal con volution , are specifically designed for mGRUIP . 2.4.1. mGR UIP with T emporal Encoding In temporal encoding, context information from se veral future frames are encoded into a fixed -size representation at the i nput projection layer . Thus equation (8) in a standard mGR UIP now becomes: v l t = W l v [ x l t ; h l t − 1 ] + K X i =1 f ( v l − 1 t + s × i ) (12) where the last summation part in equation (12) stands for tem- poral encoding. In particular , v l − 1 t + s × i is the i nput projection vector of layer l − 1 from t he ( t + s × i ) th frame. s ≥ 1 is the step stride and K is the order of future conte xt. f ( · ) de- notes the transform function applied to v l − 1 t . In this work, we tried 3 forms: identity ( f ( x ) = x ), scale ( f ( x ) = m ∗ x ) and affine transform ( f ( x ) = W x ). Preliminary results sho w t hat the i dentity function giv es slightly be tter performan ce than the other two forms. Thus we choose f ( x ) = x for the rest of this paper . It should be noted that in this case, temporal encoding brings no additional parameters for mGR UIP . 2.4.2. mGR UIP with T emporal Convo lution T emporal encoding uses the projection vector of lower layer ( v l − 1 t + s × i ) to represent t he future context, while in t emporal con- volution , t he future information is extracted from the output state v ector of lower layer and then compressed by the input projection. Equation (8) no w becomes: v l t = W l v [ x l t ; h l t − 1 ] + W l p [ h l − 1 t + s × i ; · · · ; h l − 1 t + s × K ] (13) where the last part represen ts temporal con volution . In partic- ular , h l − 1 t + s × i is the output state vector of l ayer l − 1 on the ( t + s × i ) th frame. S ame as temporal enco ding, s is the step stride and K is the context order . According to this equation, h l − 1 t + s × i from K future frames are spliced together and projected to a lo wer dimensional space by matrix W l p . Assuming the num- ber of hidden neurons in layer l − 1 is n c , temporal con volution brings K × n c × n p additional parameters. Howe ver , since the v alue of n p is usually quite small and we generally splice no more than tw o frames ( K ≤ 2 ), the increase of the model size is l i mited and acceptable. 3. Related W ork The authors in [23] proposed to insert an input projection layer to va nilla LSTM to reduce the computation cost. In this work, we tried this idea on mGR U[20], getting a model called mGR UIP , which is shown to be more effecti ve and more effi- cient than mGR U. TDNN-LSTM [ 7] is one of the most powerful acoustic model that can utilize future context effecti vely while has rel- ativ ely l ow model latency . Ho we ver , t he ability of modeling the future temporal con text comes from T DNN and has nothing to do wi th the L STM layers. In this work, thanks t o the input pro- jection layer , we empower the mGR UIP t o be capable of mod- eling the future conte xt ef fectiv ely and directly , by equipping it wit h one of the two prop osed con text modules, tempor al en- coding and tempo ral con volution . These two modules borro ws the ideas from FSMN [2, 3] and TDNN [4, 5] respectiv el y . T he differe nce i s that, FS MN and TDNN belong to FF NN, therefore both of them need to model the future context as well as the past information to capture the long-term dependencies. Whereas the two proposed context modules are placed in a RNN layer , and they only need to focus on the future context, l eaving t he history to be modeled by recurrent connections. Ro w con volution [24], which encodes future context by ap- plying a context-inde pendent weight matrix, is an other method to model the future context for RNN. The idea is similar with the two proposed contex t modules. Ho wever , row con volu tion in [24] i s only placed above all recurrent layers. While in this work, we place context modules in all hidden layers (except the first one). This l ayer-wise context e xpansion makes the higher layers having t he ability to learn wider temporal relationships than lo wer layers. What’ s more, the objectiv e function is also differe nt: connection ist temporal classification (C TC) [25 ] in [24] while lattice-free MMI (LF - MMI) [26] in this work. 4. Experiments In this section, we ev aluate the effecti veness and efficiency of the propo sed mGR UIP on two ASR tasks. The first one is the 309-hour Switchboard conv ersational telephone speech task, and the second one is an internal Mandarin vo ice input task with 1400-hou r training data. All the models in this paper are trained LF-MMI objectiv e function computed on 33Hz outputs [26]. 4.1. Switchboard ASR T ask The training data set consists of 309-h our Switchboard-I train- ing data. Ev aluation i s performed in terms of word error rate (WER) on the full S w i tchboard Hub5’00 test set, consisting of two subsets: Switchboard (S WB) and CallHome (CHE). The experimen tal setup follows [26]. W e use the speed-perturbation technique [28] for 3-fold data augmen tation, and iV ectors to perform instantaneous adaptation of the neural network [29]. WER results are reported after 4-gram LM rescoring of latt i ces generated using a trigram L M. For details about the model train- ing, the reader is directed to [26]. 4.1.1. Baseline Models T wo baseline models, LS TM and mGRU, are t rained for this task. Both of them contain 5 hidden l ayers, and the cell number for each layer is 10 24. For LS TM, we add a recurrent projec- tion layer on top of the memory blocks wi th a dimension of 512, comp ressing the cell outpu t from 1024 to 512 dimension. For mGRU, t o reduce the parameters of softmax output matrix, we insert a 512-dimensio nal linear bottleneck layer between the last hidden layer and the softmax layer . Both models are trained with an output delay of 50 ms. The input feature to both models at ti me step t is a spliced version from frame t − 2 through t + 2 . Therefore, they both have a model latency of 70 ms. Following [7], we use a mixed frame rate (MFR) across layers. In particu- lar , the fi rst hidden layer is operated at 10 0Hz frame rate whil e the rest of higher layers use a frame rate of 33Hz. 4.1.2. mGR UIP T o ev aluate the effec tiv eness of the proposed mGR UIP , we train two models containing 5 layers, mGRUIP-A and mGR U I P -B, with dif f erent architectures. In mGR UIP-A, each hid den layer consists 1024 cells ( n c = 1024 , same as the baseline models), and the input projection layer has 512 units ( n p = 512 ). While for mGR UIP-B, the cell number is 25 60 and the projection di- mension is 256. The training configurations are kept same as the baseline models. T able 1: P erformance comparison of LSTM, mGRU and mGR UI P on Swit chbo ar d t ask. Model #Param WER (%) (M) SWB CHM T otal LSTM 19.7 10.3 20.7 15.6 mGR U 22.1 10.2 20.6 15.5 mGR UIP -A 13.1 9.8 19.0 14.5 mGR UIP -B 16.2 9.7 18.8 14.3 The performance of the two mGRUIP models and two base- line models is sho wn i n T able 1. W e can see that, f or these two baseline models, mGRU has more parameters and performs slightly better than L S TM. The propo sed model mGRUIP-A contains much less parameters than the baseline mGR U (13.1M vs. 22.1M), but performs significantly better on t he full test set (14.5 vs. 15.5). This means that the input projection layer can not only reduce the parameter of mGRU, but also being benefi- cial to the performance. It is also sho wn that mGRUIP-B out- performs mGRUIP-A, meaning that we can improv e the ASR performance by increasing the cell number , meanwhile without significantly increasing the model size by reducing the projec - tion dimension in mGR UIP . Compared with mGR U, mGRUIP- B provides 7.7% relati ve WER reduction on the full t est set whereas using 5.9M less parameters. In the following experi- ments, we wi l l set n c = 2560 and n p = 256 for the mGRUIP related models. 4.1.3. mGR UIP with Context Modules It’ s obvio us that t emporal encoding and temporal con volution can utilize more future conte xt information by increasing K and s in equation (12) and (13). Ho wever , this will lead to the increase of model latency and model parameters (for tem- poral conv olution). In this work, we did a lot of experiments and found the most cost-effecti ve settings for these two context modules are as f oll o ws: T able 2: The most cost-effective settings for two context mod- ules. Layer l = 2 l = 3 l = 4 l = 5 K × s 1 × 1 1 × 3 1 × 3 1 × 3 As sho wn in T able 2, all the four higher mGR UI P layers (excep t the first one) are equipped with context modu les. T he contex t order K for all of them is 1, and the step stride s is 3 for the highest three layers while being 1 for the second hidden layer ( l = 2 ), making the operating frame rates same as the baselines. After equipped context modules with this setting, the latency of mGR UIP is increased from 70 ms to 1 70 ms. T able 3 shows the performance of mGR UIP wi t h these two context modules. W e also t r ain a TDNN-LSTM model following [7], and the results are sho wn in the second li ne of T able 3. T able 3: P erformance comparison of LSTM, mGRU and mGR UI P on Swit chbo ar d t ask. Model #Para m Latency WER (%) (M) (ms) SWB CHM T otal LSTM 19.7 70 10.3 20.7 15.6 TDNN-LSTM 34.8 200 9.0 19.7 14.4 mGR UIP-B 16.2 70 9.7 18.8 14.3 +Ctx Encd 16.2 170 9.5 18.0 13.8 +Ctx Con v 18.7 170 9.2 17.8 13.5 MFR-BLSTM[7] - 2020 9.0 - 13.6 TDNN-BLSTM-C[7] - 2130 9.0 - 13.8 Sev eral observ ations can be found in T able 3. First, both of the two context modules can improv e the ASR p erformance o f mGR UIP . T emporal con volution is more powerful than tempo - ral encoding, while brings some additional parameters. Second, compared to L STM, mGR UIP-B equipped with temporal con- volution pro vides 13.5% relat ive WER reduction, wit h a frac- tion of the cost of 100 ms additional model latency . Third, mGR UIP -B with temporal con volution is more effecti ve than TDNN-LSTM on the full t est set (13.5 vs. 14.4), with smaller model latency and much less parameters (18.7M vs. 34.8M). What’ s more, compared with the two most po werful models in [7] (the last two lines of T able 3), the propo sed model outper- forms them on the full set wit h much smaller model latency (170 ms vs. 2000 ms). 4.2. Internal Mandarin AS R T ask The second task is an internal Mandarin ASR task, of which the training set contains 1400 hours mobile recording data. The performance is ev aluated on five public-a v ai l able t est sets, in- cluding three clean and two noisy ones. The three clean sets: • Ai S hell de v: t he dev elopment set of the released corpus AiShell-1[30], containing 14326 utterances. • Ai S hell test: t he test set of the released co rpus AiShell- 1, containing 7176 utterances. • T HCHS-30 Clean: the clean t est set of THCHS-30 database[31], containing 2496 utterances. The two noisy test sets are: • T HCHS-30 Car: the corrup ted version of THCHS- 30 Clean by car noise, the noise leve l i s 0db . • T HCHS-30 Cafe: the corrupted version of THCHS- 30 Clean by cafeteria noise, the noise leve l i s 0db . Three ASR systems are built for this task: LSTM, T DNN- LSTM and mGR UIP-B with temporal con volution. The mod el architectures and the training configurations are all the same as Switchboard task. Results are shown in T able 4. T able 4: P erforman ce of differ ent models on internal Mandarin ASR task. T est LSTM TDNN-LSTM mGR UIP CER(%) CERR AiShell de v 5.39 4.81 4.66 13.5% AiShell test 6.62 5.98 5.71 13.8% THCHS-30 Clean 11.93 10.97 10.38 13.0% THCHS-30 Car 1 2.69 11.38 10.77 15.1% THCHS-30 Cafe 53.19 44.20 40.26 24.3% CERR column in T able 4 means the relativ e CER reduc- tion of mGR UIP ov er LST M. It’ s sho w n that mGRUIP performs much better than the baseline LS TM model on this task. On the three cl ean test sets, the CERR is about 13%, and the gain i s e ven larger on the two very noisy sets, from 15% to 24%. 5. Conclusions The aim of this paper is to design a RNN acoustic model that being capable of utilizing the future context effecti vely an d di- rectly , with the model latency and computation cost as low as possible. T o achie ve this goal, we choose the minimal GR U as our base model and propose to insert an input pro jection layer into it to further reduce the parameters. T o model the future con- text ef fectiv ely , we design two kinds of context mod ules, tem- pora l encoding and temporal con volution , specifically for t his architecture. E xperimental results on the S witchboard task and an internal Mandarin AS R task sho w that, the prop osed model performs much be tter than LS TM and mGRU models, whereas enables on line decoding with a latenc y of 170 ms. This mod el e ven outperforms a very strong baseline, TDNN-L STM, with smaller model latency and almost half less parameters. 6. Refere nces [1] G. E. Dahl , D. Y u, L. Deng, and A. Acero, “Conte xt-dependent pre-trai ned deep neural networks for large-v ocab ulary speech recogni tion, ” IEEE T ransactio ns on Audio Speech & Language Pr ocessing , vol. 20, no. 1, pp. 30–42, 2012. [2] S. Zhang, C. Liu, H. Jiang, S. W ei, L. Dai, and Y . Hu, “Feed- forward sequenti al memory netwo rks: A new structure to learn long-te rm dependency , ” Computer Scienc e , 2015. [3] S. Zhang, H. Jiang, S. Xiong, S. W ei, and L. R. Dai, “Com- pact feedforw ard sequential memory networks for large vocab- ulary cont inuous speech rec ogniti on, ” in INTERSPEECH , 2016, pp. 3389–3393 . [4] A. W . M. Ieee, T . Hanaza wa, G. Hinton, K. S. M. Ieee, and K. J. Lang, “ Phoneme recognition using time-dela y neural ne twork s, ” Readings in Speech Recognit ion , vol. 1, no. 2, pp. 393–404 , 1990. [5] V . Peddinti, D. Pov ey , and S. Khudanpur , “ A time delay neural netw ork architect ure for ef ficient modeling of long tempora l con- tex ts, ” in INTERSPEECH , 2015. [6] H. Sak, A. Senior , and F . Beaufays, “Long short-term m emory based recurrent neural network archite ctures for large voc abul ary speech recognition, ” Co mputer Scien ce , pp. 338–342, 2014. [7] V . Peddinti, Y . W ang, D. Pov ey , and S. Khudanpur , “Low late ncy acousti c modeling usi ng tempor al con volut ion and lstms, ” IEE E Signal Pr ocessing Letters , vo l. PP , no. 99, pp. 1–1, 2017. [8] M. Schuster and K. K. Paliw al, B idir ectional recurr ent neura l net- works . IEEE Press, 1997. [9] A. Graves, S. Fernndez, and J. Schmidhuber , Bidir ectional LSTM Network s for Impro ved Phoneme Classification and R ecognitio n . Springer Berlin Heidelber g, 2005. [10] A. Graves, N. Jaitly , and A. R. Mohamed, “Hybrid speech recog- nition with deep bidire ctiona l lstm, ” in Automat ic Speec h R ecog - nition and Understanding , 2014, pp. 273–278 . [11] Y . Zhang, G. Chen, D. Y u, K. Y ao, S. Khudanpur , and J. Glass, “Highw ay long short-te rm memory rnns for distant speec h recog- nition, ” Computer Scienc e , pp. 5755–5759, 2015. [12] A. Z eyer , R. Schlter , and H. Ney , “T o wards online-recog nition with deep bidirect ional lstm acoustic models, ” in INTERSPEECH , 2016, pp. 3424–3428. [13] K. Chen and Q. Huo, T raining deep bidir ectional LSTM acoustic model for L VCSR by a conte xt-sensitive -chun k BP TT appr oach . IEEE Press, 2016. [14] K. Chen, Z. J. Y an, and Q. Huo, “ A conte xt-sensiti ve-chunk bptt approac h to trai ning deep lstm/blstm recurrent neural networks for of fline handwriting recog nition, ” in Internat ional Confer ence on Document Analysis and Recogniti on , 2016, pp. 411–415. [15] S. Xue and Z. Y an, “Improvi ng latenc y-controlled blstm acous- tic models for on line speech re cognit ion, ” in IEE E Internat ional Confer ence on Acoustics, Speec h and Signal Proc essing , 2017, pp. 5340–5344 . [16] S. Hochreiter and J. Schmidhub er , Long s hort-term memory . Springer Berlin Heidelber g, 1997. [17] G. F . A., J. Schmidhuber , and F . Cummins, Le arning to F or get : Continua l Predic tion with LSTM . Isti tuto Dalle Molle Di Studi Sull Intel ligenz a Artificial e, 1999. [18] F . A. G ers and J. Schmidhuber , “Recurrent nets that time and count, ” in Ieee-Inns-Enns Internationa l J oint Confer ence on Neu- ral Networks , 2000, pp. 189–194 vol.3. [19] A. Gra ves and J. Schmidhuber , “Framewise pho neme cl assifica- tion with bidire ctiona l lstm and other neural network arch itec- tures, ” Neura l Netw , vol . 18, no. 5-6, p. 602, 2005. [20] M. Rav anelli, P . Brakel, M. Omologo , and Y . Bengio, “Improv- ing speech recogniti on by revising gated recurrent units, ” INTER- SPEECH , pp. 1308–1312, 2017. [21] K. Cho, B. V . Merrie nboer , D. Bahda nau, and Y . Bengio, “On the propertie s of neural machine translation: E ncoder -decoder ap- proache s, ” Computer Science , 2014. [22] J. Chung, C. Gulcehre , K. H. Cho, and Y . Bengio, “Empirica l e val- uation of gated recurre nt neural networks on s equence modeli ng, ” Eprint Arxiv , 2014. [23] T . Masuko, “Computa tional cost reducti on of long short-t erm memory based on simultaneous compression of input and hid- den state, ” in Automatic Speech R eco gnitio n and Understan ding , 2017. [24] D. Amodei, R. Anubhai, E . Battenber g, C. Case, J. Casper , B. Cata nzaro, J. Chen, M. Chrzano wski, A. Coates, and G. Di- amos, “Deep speech 2: End-to-end speech recognition in english and mandarin, ” in ICML , 2015. [25] A. Grav es and F . Gomez, “Conne ctioni st temporal classifica- tion:l abell ing unsegmente d sequenc e data with recurrent neural netw orks, ” in Internati onal Confer ence on Mac hine Learning , 2006, pp. 369–376. [26] D. Pove y , V . Peddinti, D. Galvez, P . Ghahremani, V . Manohar , X. Na, Y . W ang, and S. Khudanp ur , “Purel y sequence-trai ned neural networks for asr based on lattice -free mmi, ” in INTER- SPEECH , 2016, pp. 2751–2755. [27] K. V esel, A. Ghoshal, L. Burget, and D. Pove y , “Sequence - discrimina ti ve training of deep neural netw orks, ” Pr oc Inter- speec h , 2013. [28] T . Ko, V . Peddinti , D. Pov ey , and S. Khudanpur , “ Audio augmen- tatio n for speech recogni tion, ” Pr oc Inter s peec h , 2015. [29] G. Saon, H. Soltau, D. Nahamoo, and M. Pichen y , “Spea ker adap- tatio n of neural netw ork acoustic models using i-ve ctors, ” in Au- tomatic Speec h Recogni tion and Under standing , 2014, pp. 55–59. [30] H. Bu, J. Du, X. Na, B. Wu , and H. Zheng, “ Aishel l-1: An open- source m andarin speech corpus and a speech recogni tion base- line, ” 2017. [31] Z. Z. Dong W ang, Xue wei Zhang, “Thchs-30 : A free chinese sp eech corpus, ” 2015. [Online]. A vail able: http:/ /arxi v .org/abs/151 2.01882
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment