High Order Recurrent Neural Networks for Acoustic Modelling

HIGH ORDER RECURRENT NEURAL NETWORKS FOR A COUSTIC MODELLING C. Zhang & P . C. W oodland Cambridge Uni versity Engineering Dept., T rumpington St., Cambridge, CB2 1PZ U.K. { cz277,pcw } @eng.cam.ac.uk ABSTRA CT V anishing long-term gradients are a major issue in training stan- dard recurrent neural networks (RNNs), which can be alleviated by long short-term memory (LSTM) models with memory cells. Ho w- ev er , the e xtra parameters associated with the memory cells mean an LSTM layer has four times as many parameters as an RNN with the same hidden vector size. This paper addresses the v anishing gradi- ent problem using a high order RNN (HORNN) which has additional connections from multiple pre vious time steps. Speech recognition experiments using British English multi-genre broadcast (MGB3) data sho wed that the proposed HORNN architectures for rectiﬁed linear unit and sigmoid activ ation functions reduced w ord error rates (WER) by 4.2% and 6.3% ov er the corresponding RNNs, and gav e similar WERs to a (projected) LSTM while using only 20%–50% of the recurrent layer parameters and computation. 1. INTR ODUCTION A recurrent neural network (RNN) is an artiﬁcial neural network layer where hidden layer outputs from the previous time step form part of the input used to process the current time step [1, 2]. This al- lows information to be preserved through time and is well suited to sequence processing problems, such as acoustic and language mod- elling for automatic speech recognition [3, 4]. Howe v er , training RNNs with sigmoid acti vation functions by gradient descent can be dif ﬁcult. The ke y issues are exploding and vanishing gradients [5], i.e . , the long-term gradients, which are back-propagated through time, can either continually increase (e xplode) or decrease to zero. This causes RNN training to either fail to capture long-term temporal relations or for standard update steps to put parameters out of range. Many methods hav e been proposed to solve the gradient explod- ing and vanishing problems. While simple gradient clipping has been found to work well in practice to prevent gradient exploding [4], circumventing vanishing gradients normally requires more so- phisticated strategies [6]. For instance [7] uses Hessian-Free train- ing which makes use of second-order deri vati v e information. Mod- ifying the recurrent layer structure is another approach. The use of both rectiﬁed linear unit (ReLU) and sigmoid activ ation functions with trainable amplitudes were proposed to maintain the magnitude of RNN long-term gradients [8–10]. A gating technique is used in the long short-term memory (LSTM) model where additional pa- rameters implement a memory circuit which can remember long- term information from the recurrent layer [11]. A model similar to the LSTM is the gated recurrent unit [12]. More recently , ad- ditional residual [13] and highway connections [14] were proposed to train very deep feed-forward models, which allows gradients to pass more easily through many layers. V arious similar ideas have been applied to recurrent models [15–20]. Among these approaches, Thanks to Mark Gales and the MGB3 team for the MGB3 setup used. the LSTM has recently become the dominant type of recurrent ar- chitecture. Howe ver LSTMs, due to the extra parameters associated with gating, use four times more parameters as standard RNNs with the same hidden layer size, which signiﬁcantly increases storage and computation in both training and testing. In this paper , we propose another RNN modiﬁcation, the high order RNN (HORNN), as an alternativ e to the LSTM. It handles vanishing gradients by adding connections from hidden state v alues at multiple previous time steps to the RNN input. By interpreting the RNN layer hidden vector as a continuous val ued hidden state, the connections are termed high order since they introduce depen- dencies on multiple pre vious hidden states. Acoustic modelling us- ing HORNNs is in vestig ated for both sigmoid and ReLU activ ation functions. In the sigmoid case, it is found that additional high order connections are beneﬁcial. Furthermore, analogous to the projected LSTM (LSTMP) [22], a linear recurrent projection layer can be used by HORNNs to reduce the number of parameters, which results in the projected HORNN (HORNNP). Experimental results show that the HORNN/HORNNP (both sigmoid and ReLU) hav e similar word error rates (WERs) to LSTM/LSTMP models with the same hidden vector size, while using fe wer than half the parameters and computa- tion. Furthermore, HORNNs were also found to outperform RNNs with residual connections in terms of both speed and WER. This paper is organised as follows. Section 2 re views RNN and LSTM models. The (conditional) Marko v property of RNNs is de- scribed in Sec. 3, which leads to HORNNs and architectures for both sigmoid and ReLU activ ation functions. The e xperimental setup and results are giv en in Sec. 4 and Sec. 5, follo wed by conclusions. 2. RNN AND LSTM MODELS In this paper , an RNN refers to an Elman network [2] that produces its output hidden vector at step t , h t , based on the previous output h t − 1 and the current input x t by h t = f ( a t ) = f ( Wx t + Uh t − 1 + b ) , (1) where W and U are the weights, b is the bias, and f ( · ) and a t are the acti vation function and its input activation value. In general, h t is processed by a number of further layers to obtain the ﬁnal network output. It is well kno wn that when f ( · ) is the sigmoid denoted σ ( · ) , RNNs suffer from the v anishing gradient issue since ∂ σ ( a t ) ∂ a t = σ ( a t )(1 − σ ( a t )) 6 1 4 , which enforces gradient magnitute reductions in backpropagation [3]. Note that ReLU RNNs suffer less from this issue. In contrast to a standard RNN, the LSTM model resolves gra- dient vanishing by using an additional linear state c t at each step of the sequence, which can be viewed as a memory cell. At each T o appear in Proc. ICASSP 2018, April 15-20, 2018, Calgary , Canada c  IEEE 2018 step, a new cell candidate ˜ c t is created to encode the information from the current step. c t is ﬁrst updated by interpolating c t − 1 with ˜ c t based on the forget gate f t and input gate i t , and then con v erted to the LSTM hidden state by transforming with hyperbolic tangent (tanh) and scaling by the output gate o t . This procedure simulates a memory circuit where f t , i t , and o t are analogous to its logic gates [11]. More speciﬁcally , an LSTM layer step t is ev aluated as i t = σ ( W i x t + U i h t − 1 + V i  c t − 1 + b i ) , f t = σ ( W f x t + U f h t − 1 + V f  c t − 1 + b f ) , ˜ c t = tanh( W c x t + U c h t − 1 + b c ) , c t = f t  c t − 1 + i t  ˜ c t , o t = σ ( W o x t + U o h t − 1 + V o  c t + b o ) , h t = o t  tanh( c t ) , where  represents element-wise product, and the V matrices are diagonal which serve as a “peephole”. Although LSTMs work very well on a large v ariety of tasks, it is computationally v ery e xpensi ve. The representations for each temporal step, ˜ c t , are extracted in the same w ay as the RNN h t . Howe ver , the additional cost of ﬁnding c t ov er ˜ c t , requires three times the computation and parameter storage since i t , f t , and o t all need to be calculated. 3. HIGH ORDER RNN A COUSTIC MODELS In this section, HORNNs are proposed by relaxing the ﬁrst-order Markov conditional independence constraint. 3.1. Markov Conditional Independence The posterior probability of the T frame label sequence y 1: T giv en the T frame input sequence x 1: T can be found by integrating over all possible continuous hidden state sequences ˜ h 1: T P ( y 1: T | x 1: T ) = Z P ( y 1: T | ˜ h 1: T , x 1: T ) p ( ˜ h 1: T | x 1: T ) d ˜ h 1: T = Z T Y t =1 P ( y t | y 1: t − 1 , ˜ h 1: T , x 1: T ) p ( ˜ h t | ˜ h 1: t − 1 , x 1: T ) d ˜ h 1: T . When implemented using an RNN, P ( y t | y 1: t − 1 , ˜ h 1: T , x 1: T ) = P ( y t | ˜ h t ) , which is produced by the layers after the RNN layer . From Eqn. (1), ˜ h t depends only on ˜ h t − 1 and x t , i.e. , p ( ˜ h t | ˜ h 1: t − 1 , x 1: T ) = p ( ˜ h t | ˜ h t − 1 , x t ) . (2) Since the initial hidden state is given (often set to h 0 = 0 ), all subsequent states h 1: T are determined by Eqn. (1), which means p ( ˜ h t | h t − 1 , x t ) is a Kronecker delta function p ( ˜ h t | h t − 1 , x t ) =  1 if ˜ h t = h t 0 otherwise . Hence P ( y 1: T | x 1: T ) = Q T t =1 P ( y t | h t = f ( Wx t + Uh t − 1 + b )) . Eqn. (2) is the 1st-order Markov conditional independence pr op- erty [23]. It means that the current state h t depends only on its immediately preceding state h t − 1 and the current input x t . This property differs from the 1st-order Markov pr operty by also condi- tioning on x t . 1 Note that this property also applies to bidirectional RNNs [24], which is easy to show by deﬁning a new hidden state h bid t = { h fwd t , h bwd t } , where h fwd t and h bwd t are the forw ard and back- ward RNN hidden states. 1 For language modelling, h t has the standard Markov property as the RNN models P ( y 1: T ) without conditioning on x 1: T . 3.2. HORNNs for Sigmoid and ReLU Activation Functions In this paper, the gradient vanishing issue is tackled by relaxing the ﬁrst-order Marko v conditional independence constraint. Hence, not only the direct preceding state h t − 1 but also previous states h t − n ( n > 1) are used when calculating h t . This adds additional high order connections to the RNN architecture and results in a HORNN. From a training perspectiv e, including high order states creates shortcuts for backpropagation to allo w additional long-term information to ﬂow more easily . Speciﬁcally , the gradients w .r .t. h t − 1 of a general n -order RNN can be obtained by ∂ F ∂ h t − 1 = n X i =1 ∂ F ∂ h t − i − 1 ∂ h t − i − 1 ∂ h t − 1 , (3) where F is the training criterion. For n > 1 , Eqn. (3) sums multiple terms to prevent the gradient vanishing. From an inference (testing) perspectiv e, an RNN assumes sufﬁcient past temporal information has been embedded in the representation h t − 1 , but using a ﬁxed sized h t − 1 , means that information from distant long-term steps may not be properly integrated with new short-term information. The HORNN architecture allo ws more direct access to the past long-term information. There are many alternative ways of using h t − n in the calculation of h t in the HORNN framework. This paper assumes that the high order connections are linked to the input at step t . It was found to be sufﬁcient to use only one high order connection at the input, i.e . h t = f ( Wx t + U 1 h t − 1 + U n h t − n + b ) . (4) Here h t − n can be vie wed as a kind of “memory” whose temporal resolution is modiﬁed by U n . From our e xperiments the structure in Eqn. (4) allowed ReLU HORNNs to give similar WERs to LSTMs. Howe v er , when using sigmoid HORNNs, a slightly dif ferent struc- ture is needed to reach a similar WER. This has an extra high order connection from h t − m to the sigmoid function input, i.e. h t = f ( Wx t + U 1 h t − 1 + U n h t − n + h t − m + b ) . (5) Here, h t − m is directly added to the sigmoid input without impacting the temporal resolution at t since h t − m is from a previous sigmoid output. Eqns. (4) and (5) are used for ReLU and sigmoid HORNNs throughout the paper . 3.3. Parameter Control using Matrix Factorisation Comparing Eqns. (4) and (5) to Eqn. (1), HORNN increases the number of RNN layer parameters from ( D x + D h ) D h + D h to ( D x + 2 D h ) D h + D h , where D x and D h are the sizes of x t and h t . One method to reduce the increase in parameters is to project the hidden state vectors to some a dimension D p with a recurrent linear projection P [22]. This factorises U 1 and U n in Eqns. (4) and (5) to U p 1 P and U pn P with a low-rank appr oximation . The projected HORNNs (denoted by HORNNP) for ReLU and sigmoid activations are hence deﬁned as h t = f ( Wx t + U p 1 Ph t − 1 + U pn Ph t − n + b ) (6) and h t = f ( Wx t + U p 1 Ph t − 1 + U pn Ph t − n + h t − m + b ) , (7) and the number of parameters used is D h D p + ( D x + 2 D p ) D h + D h . The resulting parameter reduction ratio is approximately 2 D h / 3 D p (giv en D h > D p  D x ). Note that the same idea was used by 2 the projected LSTM (LSTMP) to factorise U i , U f , U c , and U o [22], which reduces the number of LSTM parameters from 4( D x + D h ) D h + 7 D h to D h D p + 4( D x + D p ) D h + 7 D h . Next we compare the computational complexity of LSTMs and HORNNs. Giv en that multiplying a l × m matrix by a m × n matrix ( l 6 = m 6 = n ) requires l mn multiply-adds, and ignoring all element-wise operations, the testing complexity for a HORNNP layer is O ( T ( D x + 3 D p ) D h ) , whereas for an LSTMP it is O ( T D h D p + 4 T ( D x + D p ) D h ) . This sho ws that HORNNPs use less than 3/5 of the calculations of LSTMPs. It has been found that HORNNPs often result in a 50% speed up over LSTMPs in our current HTK implementation [25–27]. 3.4. Related W ork After independently developing the HORNN for acoustic modelling, we found that similar ideas had previously been applied to rather different tasks [28–32]. Howe ver , both the research focus and model architectures were different to this paper . In particular , the model proposed in [28, 31] is equi valent to Eqn. (4) without subsampling the high order hidden v ectors, and [32] applied that m odel to TIMIT phone recognition. Furthermore, pre vious studies didn’t discuss the high order connections in the Markov property frame work. Adding h t − m to the input of the sigmoid function in Eqn. (5) is similar to the residual connection in residual networks [13]. A residual RNN (ResRNN) with a recurr ent kernel depth of two ( d = 2 ) can be written as h t = f ( U d 2 f ( Wx t + U d 1 h t − 1 + b ) + h t − m ) , (8) where m = 1 [17]. Another related model is the recent residual memory network [21], which can be viewed as an unfolded HORNN deﬁned in Eqn. (4) with U 1 and b being zero, W being distinct un- tied parameters in each unfolded layer , and n > 1 being any positive integer . In addition, since highway networks [14] can be viewed as a generalised form of the residual networks, highway RNNs and LSTMs are also related to this work [15, 19]. Note that it is also possible to combine the residual and highway ideas with HORNNs by increasing the recurrent depth. 4. EXPERIMENT AL SETUP The proposed HORNN models were ev aluated by training systems on multi-genre broadcast (MGB) data from the MGB3 speech recog- nition challenge task [33, 34]. The audio is from BBC TV pro- grammes cov ering a range of genres. A 275 hour (275h) full training set was selected from 750 episodes where the sub-titles have a phone matched error rate < 40% compared to the lightly supervised output [35] which w as used as training supervision. A 55 hour (55h) subset was sampled at the utterance lev el from the 275h set. A 63k word vocab ulary [36] was used with a trigram word le vel language model (LM) estimated from both the acoustic transcripts and a separate 640 million word MGB subtitle archive. The test set, dev17b , contains 5.55 hours of audio data and 5,201 manually segmented utterances from 14 episodes of 13 shows. This is a subset of the of ﬁcial full de- velopment set ( de v17a ) with data that o verlaps training and test sets excluded. System outputs were ev aluated with confusion network decoding (CN) [37] as well as 1-best V iterbi decoding. All experiments were conducted with an extended version of HTK 3.5 [25, 26]. The LSTM was implemented following [22]. A 40d log-Mel ﬁlter bank analysis was used and expanded to an 80d vector with its ∆ coefﬁcients. The data was normalised at the utter- ance le vel for mean and at the sho w-segment le vel for variance [38]. The inputs at each recurrent model time step were single frames de- layed for 5 steps [22, 39]. All models were trained using the cross- entropy criterion and frame-lev el shufﬂing used. All recurrent mod- els were unfolded for 20 time steps, and the gradients of the shared parameters were normalised by dividing by the sharing counts [26]. The maximum parameter changes were constrained by update value clipping with a threshold of 0.32 for a minibatch with 800 samples. About 6k/9k decision tree clustered triphone tied-states along with GMM-HMM/DNN-HMM system training alignments were used for the 55h/275h training sets. One hidden layer with the same dimension as h t was added between the recurrent and output layers to all models. The Ne wBob + learning rate scheduler [26, 27] was used to train all models with the setup from our previous MGB systems [38]. An initial learning rate of 5 × 10 − 4 was used for all ReLU models, while an initial rate of 2 × 10 − 3 was used to train all the other models. Since regularisation plays an important role in RNN/LSTM training, weight decay factors were carefully tuned to maximise the performance of each system. 5. EXPERIMENT AL RESUL TS 5.1. 55 Hour Single Layer HORNN Experiments Initial experiments studied various HORNN architectures in order to inv estigate suitable values of n for the ReLU model in Eqn. (4), and for both m and n for the sigmoid model in Eqn. (5). T o save computation, the 55h subset was used for training. All models had one recurrent layer with the h t size ﬁxed to 500. An LSTM and a standard RNN were created as baselines, which had 1.16M and 0.29M parameters in the recurrent layers respectively . A ResRNN, deﬁned by Eqn. (8) was also tested as an additional baseline using both ReLU and sigmoid functions. 2 ResRNNs had the same number of parameters (0.54M) as the HORNNs. Note that rather than the standard case with m = 1 [17], m ∈ [1 , 4] were examined which falls into the high order framew ork when m > 1 . For HORNNs, n ∈ [2 , 6] were tested; m was ﬁxed to 2 for all sigmoid HORNNs. From the results sho wn in Figure 1, the LSTM gi ves lower WERs than a standard RNN, but the ReLU ResRNN with m set to 1 or 2 had a similar WER to the LSTM. ReLU HORNNs gav e WERs at least as low as the LSTM and the best ReLU ResRNN systems. Sigmoid HORNNs gav e better WERs than sigmoid ResRNNs and similar WERs to those from the LSTM. The performance can be further improv ed by using p -sigmoid [40] as the HORNN activ ation function which associates a linear scaling factor to each recurrent layer output unit and makes it more similar to a ReLU. In addition, HORNNs were faster than both LSTM and ResRNNs. ResRNNs were slightly slower than HORNNs since the second matrix multiplication depends on the ﬁrst one at each recur- rent step. For the rest of the experiments, all ReLU HORNNs used n = 4 , and all sigmoid HORNNs used m = 1 and n = 2 . 5.2. Projected and Multi-Layered HORNN Results Next, projected LSTMs and projected HORNNs were compared. First, D h (the size of h t ) and D p (the projected vector size) were ﬁxed to 500 and 250 respectiv ely for the single recurrent layer (1L) LSTMP and HORNNP models. The LSTMP baseline L 55h 1 had 0.79M parameters and HORNNP system S 55h 1 and R 55h 1 had 0.42M parameters. From T able 1, the HORNNPs have similar WERs to the LSTMP . By further reducing D p to 250, the HORNN systems, S 55h 2 and R 55h 2 , reduced the number of parameters to 0.23M and ga ve 2 This is also the ﬁrst time to apply such ResRNNs to acoustic modelling. 3 m=1 m=2 m=3 m=4 m=1 m=2 m=3 m=4 31 32 33 34 35 36 Baseline System %WER 32.9 32.2 33.9 33.2 35.4 34.3 32.7 32.9 33.1 33.0 32.1 32.2 32.3 32.3 34.3 33.5 33.9 33.8 33.4 32.7 33.1 33.1 LSTM tg LSTM cn ReLU RNN tg ReLU RNN cn sigmoid RNN tg sigmoid RNN cn ReLU ResRNN tg ReLU ResRNN cn sigmoid ResRNN tg sigmoid ResRNN cn n=2 n=3 n=4 n=5 n=6 m=1 n=2 m=1 n=3 m=1 n=4 m=1 n=5 m=1 n=6 m=1 n=2 31 32 33 34 35 36 HORNN System %WER 32.7 32.7 32.5 32.8 32.5 31.9 31.9 31.8 31.9 31.8 33.1 33.3 32.9 33.1 33.3 32.2 32.6 32.3 32.3 32.5 32.9 32.0 ReLU HORNN tg ReLU HORNN cn sigmoid HORNN tg sigmoid HORNN cn p - s i g m o i d H O R N N t g p - s i g m o i d H O R N N c n Fig. 1 . %WERs of 55h systems on dev17b . Systems use a trigram LM with V iterbi decoding (tg) or CN decoding (cn). similar WERs to LSTM and LSTMP (L 55h 1 ) with only 20% and 29% of the recurrent layer parameters. The values of D h and D p for HORNNs were increased to 800 and 400 respecti vely to make the ov erall number of recurrent layer parameters (1.02M) closer to that of the 500d LSTM (1.16M). This produced system S 55h 3 and R 55h 3 . The LSTMP was also modiﬁed to D h = 600 and D p = 300 to hav e 1.10M parameters. From the results in T able 1, S 55h 3 and R 55h 3 both outperformed L 55h 2 by a mar gin since the 800d representations embed more accurate temporal infor- mation than with 600d. The p -sigmoid function was not used for HORNNPs since the linear projection layer also scales h t . Finally , the LSTMP and HORNNP were compared by stacking another recurrent layer . W ith two recurrent layers (2L) of D h = 500 and D p = 250 , the 2L HORNNP systems S 55h 4 and R 55h 4 had 0.92M parameters and still produced similar WERs to the 2L LSTMP sys- tem L 55h 3 (with 1.91M parameters). These results indicate that rather than spending most of the calculations on maintaining the LSTM memory cell, it is more effecti v e to use HORNNs and use the com- putational b udget for extracting better temporal representations us- ing wider and deeper recurrent layers. 5.3. Experiments on 275 Hour Data Set T o ensure that the previous results scale to a signiﬁcantly larger train- ing set, some selected LSTMP and HORNNP systems were built on ID System D h D p tg cn L 55h 1 1L LSTMP 500 250 32.9 32.1 L 55h 2 1L LSTMP 600 300 32.7 32.0 L 55h 3 2L LSTMP 500 250 31.3 30.6 S 55h 1 1L sigmoid HORNNP 500 250 32.8 31.9 S 55h 2 1L sigmoid HORNNP 500 125 33.0 32.1 S 55h 3 1L sigmoid HORNNP 800 400 31.6 30.9 S 55h 4 2L sigmoid HORNNP 500 250 31.4 30.7 R 55h 1 1L ReLU HORNNP 500 250 32.0 31.4 R 55h 2 1L ReLU HORNNP 500 125 32.5 31.8 R 55h 3 1L ReLU HORNNP 800 400 31.4 30.7 R 55h 4 2L ReLU HORNNP 500 250 31.4 30.7 T able 1 . %WERs for various 55h system on dev17b . Systems use a trigram LM with V iterbi decoding (tg) or CN decoding (cn). the full 275h set. Here D h and D p were set to 1000 and 500, which increased the number of recurrent layer parameters to better model the full training set. From T able 2, for both single recurrent layer and two recurrent layer architectures, HORNNs still produced simi- lar WERs to the corresponding LSTMPs. This validates our previous ﬁnding on a larger data set that the proposed HORNN structures can work as well as the widely used LSTMs on acoustic modelling by us- ing far fe wer parameters. In addition, along with the multi-layered structure, HORNNs can also be applied to other kinds of recurrent models by replacing RNNs and LSTMs, such as the bidirectional [24] and grid [39, 41, 42] structures etc . Finally , a 7 layer (7L) sig- moid DNN system, D 275h 1 , was b uilt following [38] as a reference. ID System D h D p tg cn L 275h 1 1L LSTMP 1000 500 26.5 26.0 S 275h 1 1L sigmoid HORNNP 1000 500 26.4 25.8 R 275h 1 1L ReLU HORNNP 1000 500 26.4 25.9 L 275h 3 2L LSTMP 1000 500 25.7 25.2 S 275h 4 2L sigmoid HORNNP 1000 500 25.6 25.2 R 275h 4 2L ReLU HORNNP 1000 500 25.3 25.0 D 275h 1 7L sigmoid DNN 1000 28.4 27.5 T able 2 . %WERs for a selection of 275h system on dev17b . Systems use a trigram LM with V iterbi decoding (tg) or CN decoding (cn). 6. CONCLUSIONS This paper proposed the use of HORNNs for acoustic modelling to address the vanishing gradient problem in training recurrent neural networks. T wo dif ferent architectures were proposed to cover both ReLU and sigmoid activ ation functions. These yielded 4%-6% WER reductions over the standard RNNs with the same activ ation func- tion. Furthermore, additional structures were in vestigated: reducing the number of HORNN parameters with a linear recurrent projected layer; and adding another recurrent layer . In all cases, compared to the projected LSTMs and the residual RNNs, it was sho wn that HORNNs ga ve similar WER performance while being signiﬁcantly more efﬁcient in computation and storage. When the savings in pa- rameter number and computation are used to implement wider or deeper recurrent layers, (projected) HORNNs gav e a 4% relati ve re- duction in WER ov er the comparable (projected) LSTMs . 4 7. REFERENCES [1] D.E. Rumelhart, J.L. McClelland, & the PDP Research Group P arallel Distributed Processing: Explorations in the Microstructur e of Cogni- tion, V olume 1: F oundations , MIT Press, 1986. [2] J.L. Elman, “Finding structure in time”, Cognitive Science , vol. 14, pp. 179–211, 1990. [3] T . Robinson, M. Hochberg and S. Renals. “The use of recurrent neural networks in continuous speech recognition”, In Automatic Speech and Speaker Recognition , pp. 233–258, Springer , 1996. [4] T . Mikolov , Statistical Language Models based on Neural Networks , Ph.D. thesis, Brno Uni versity of T echnology , Brno, Czech Republic, 2012. [5] Y . Bengio, P . Simard, & P . Frasconi, “Learning long-term dependen- cies with gradient descent is difﬁcult”, IEEE T r ansactions on Neural Networks , vol. 5, pp. 157–166, 1994. [6] R. Pascanu, T . Mikolov , & Y . Bengio, “On the difﬁculty of training recurrent neural networks”, Proc. ICML , Atlanta, 2013. [7] I. Sutskever , J. Martens, & G. Hinton, “Generating text with recurrent neural networks”, Proc. ICML , New Y ork, 2011. [8] E. Salinas & L.F . Abbott, “ A model of multiplicative neural responses in parietal cortex”, Proc. National Academy of Science U.S.A. , v ol. 93, pp. 11956–11961, 1996. [9] R.L.T . Hahnloser, “On the piecewise analysis of networks of linear threshold neurons”, Neural Networks , vol. 11, pp. 691–697, 1998. [10] S.L. Goh & D.P . Mandic “Recurrent neural networks with trainable amplitude of activation functions”, Neural Networks , vol. 16, pp. 1095– 1100, 2003. [11] S. Hochreiter & J. Schmidhuber , “Long short-term memory”, Neural Computation , vol. 9, pp. 1735–1780, 1997. [12] J. Chung, C. Gulcehre, K.H. Cho, & Y . Bengio, “Empirical ev aluation of gated recurrent neural networks on sequence modeling”, arXiv .org , 1412.3555, 2014. [13] K. He, X. Zhang, S. Ren, & J. Sun, “Deep residual learning for image recognition”, Pr oc. CVPR , Las V egas, 2016. [14] R.K. Sri vasta v a, K. Greff, & J. Schmidhuber, “Highway networks”, arXiv .org , 1505.00387, 2015. [15] J.G. Zilly , R.K. Sriv astav a, J. Koutn ´ ık, & J. Schmidhuber , “Recurrent highway networks”, arXiv .or g , 1607.03474, 2016. [16] Y . Zhang, G. Chen, D. Y u, K. Y ao, S. Khudanpur , & J. Glass, “Highway long short-term memory RNNs for distant speech recognition”, Pr oc. ICASSP , Shanghai, 2016. [17] Y . W ang & F . T ian, “Recurrent residual learning for sequence classiﬁ- cation”, Pr oc. EMNLP , Austin, 2016. [18] A. v an den Oord, N. Kalchbrenner, & K. Kavukcuoglu, “Pixel recurrent neural networks”, Proc. ICML , New Y ork, 2016. [19] G. Pundak & T .N. Sainath, “Highway-LSTM and recurrent highway networks for speech recognition”, Pr oc. Interspeech , Stockholm, 2017. [20] J. Kim, M. El-Khamy , & J. Lee, “Residual LSTM: Design of a deep re- current architecture for distant speech recognition”, Proc. Interspeech , Stockholm, 2017. [21] M.K. Baskar, M. Karaﬁ ´ at, L. Burget, K. V esel ´ y, F . Gr ´ ezl, & J.H. ˇ Cernock ´ y, “Residual memory networks: Feed-forward approach to learn long-term temporal dependencies”, Pr oc. ICASSP , New Or- leans, 2017. [22] H. Sak, A. Senior, & F . Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling”, Proc. Interspeech , Singapore, 2014. [23] Y . Bengio & P . Frasconi, Creadit assignment thr ough time: Alternatives to backpr opa gation , Advances in NIPS 6 , Hong Kong, 1993. [24] M. Schuster & K.K. Paliwal, “Bidirectional recurrent neural netw orks”, IEEE T ransactions on Signal Processing , vol. 45, pp. 2673–2681, 1997. [25] S. Y oung, G. Evermann, M. Gales, T . Hain, D. Kershaw , X. Liu, G. Moore, J. Odell, D. Ollason, D. Pov ey , A. Ragni, V . V altchev , P . W oodland, & C. Zhang, The HTK Book (for HTK version 3.5) , Cam- bridge Univ ersity Engineering Department, 2015. [26] C. Zhang & P .C. W oodland, “ A general artiﬁcial neural netw ork exten- sion for HTK”, Pr oc. Interspeech , Dresden, 2015. [27] C. Zhang, Joint T raining Methods for T andem and Hybrid Speech Recognition Systems using Deep Neural Networks , Ph.D. thesis, Uni- versity of Cambridge, Cambridge, UK, 2017. [28] T . Lin, B.G. Horne, P . Ti ˇ no, & C. Lee Giles, “Learning long-term dependencies in NARX recurrent neural networks”, IEEE Tr ansactions on Neural Networks , vol. 7, pp. 1329–1338, 1996. [29] P . T i ˇ no, M. ˇ Cer ˇ nansk ´ y, & L. Be ˇ nu ˇ skov ´ a, “Markovian architectural bias of recurrent neural netw orks”, IEEE T ransactions on Neural Networks , vol. 15, pp. 6–15, 2004. [30] I. Sutskev er & G. Hinton, “T emporal-kernel recurrent neural net- works”, Neural Networks , vol. 23, pp. 239–243, 2010. [31] R. Soltani & H. Jiang, “Higher order recurrent neural networks”, arXiv .org , 1605.00064, 2016. [32] H. Huang & B. Mak, “T o improve the robustness of LSTM-RNN acoustic models using higher-order feedback from multiple histories”, Pr oc. Interspeec h , Stockholm, 2017. [33] http://www.mgb- challenge.org [34] P . Bell, M.J.F . Gales, T . Hain, J. Kilgour , P . Lanchantin, X. Liu, A. Mc- Parland, S. Renals, O. Saz, M. W ester , & P .C. W oodland, “The MGB challenge: Evaluating multi-genre broadcast media transcrip- tion”, Pr oc. ASRU , Scottsdale, 2015. [35] P . Lanchantin, M.J.F . Gales, P . Karanasou, X. Liu, Y . Qian, L. W ang, P .C. W oodland, & C. Zhang, “Selection of Multi-Genre Broadcast data for the training of automatic speech recognition systems”, Proc. Inter- speech , San Francisco, 2016. [36] K. Richmond, R. Clark, & S. Fitt, “On generating Combilex pronuncia- tions via morphological analysis”, Pr oc. Interspeech , Makuhari, 2010. [37] G. Evermann & P . W oodland, “Large vocabulary decoding and con- ﬁdence estimation using word posterior probabilities”, Pr oc. ICASSP , Istanbul, 2000. [38] P .C. W oodland, X. Liu, Y . Qian, C. Zhang, M.J.F . Gales, P . Karana- sou, P . Lanchantin, & L. W ang, “Cambridge University transcription systems for the Multi-Genre Broadcast challenge”, Pr oc. ASR U , Scotts- dale, 2015. [39] B. Li & T .N. Sainath, “Reducing the computational complexity of twodimensional LSTMs”, Proc. Interspeech , Stockholm, 2017. [40] C. Zhang & P .C. W oodland, “Parameterised sigmoid and ReLU hidden activ ation functions for DNN acoustic modelling”, Pr oc. Interspeech , Dresden, 2015. [41] N. Kalchbrenner , I. Danihelka, & A. Gra ves, “Grid long short-term memory”, Pr oc. ICLR , San Juan, 2016. [42] F .L. Kreyssig, C. Zhang, & P .C. W oodland, “Improved TDNNs using deep kernels and frequenc y dependent Grid-RNNs”, Proc. ICASSP , Calgary , 2018. 5

High Order Recurrent Neural Networks for Acoustic Modelling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment