Approximating Stacked and Bidirectional Recurrent Architectures with the Delayed Recurrent Neural Network

A ppr oximating Stacked an d Bidire ctional Recurr ent Archi tectures wi th the Delayed Recurr ent Neural Network Javier S. T urek 1 Shailee Jain 2 Vy A. V o 1 Mihai Capot ˘ a 1 Alexander G. Huth 2 3 Theodore L. Willke 1 Abstract Recent work has sho wn that topolog ical en- hancemen ts to recurren t neural networks (RNNs) can in c r ease their expressiv eness and representa- tional capacity . T w o p o pular enhancemen ts are stacked RNNs, which increases th e capacity fo r learning non-lin e ar functions, and bidirection al processing, which exploits acau sal infor mation in a sequence. In th is work, we explore the delayed- RNN, which is a single-layer RNN that has a d elay between the input and o utput. W e prove that a weight-c o nstrained versio n o f the delayed- RNN is equivalent to a stacked-RNN. W e also show that th e delay gives rise to par- tial acausality , mu ch like bidirection al n etworks. Synthetic experiments conﬁrm that the delay e d- RNN can mimic b idirectiona l networks, solv ing some acausal tasks similarly , and outperfor m- ing them in others. Moreover , we sho w similar perfor mance to bidirectional netw ork s in a real- world natural la n guage pro cessing task. These results suggest that delayed- RNNs can ap p rox- imate topologies inclu ding stacked RNNs, bidi- rectional RNNs, and stacked bid irectional RNNs – but with equiv alent or faster ru ntimes for the delayed- RNNs. 1. Introduc tion Recurrent neural networks (RNN) hav e success- fully been used for sequential tasks like lan guage modeling ( Sutske ver et al. , 2 011 ), machine transla- tion ( Sutskev er et al. , 20 14 ), and speec h r ecognition ( Amodei et al. , 201 6 ). They approx im ate co m plex, non - 1 Intel L abs, Hillsboro, Oregon, USA 2 Department of Com- puter Science, The Univ ersity of T e xas at Austin, Austin, T exas, USA 3 Department o f Neuroscience, The Uni versity of T exas at Austin, Austin, T exas, USA. Correspondence to: Javier S. T urek . Pr oceedings of the 37 th International Confer ence on Machine Learning , V ienna, Austria, PMLR 108, 2020. Copyright 2020 by the author(s). linear tempo ral relationship s b y maintainin g and upd a ting an intern al state fo r every input elemen t. Howe ver , they face se vera l challenges while modeling lo ng-term depend encies, motivating work on variant architectures. Firstly , due to the long credit assignm ent paths in RNNs, the g radients might v anish o r explode ( Bengio et al. , 199 4 ). This has led to gated variants like the Long Short-term Memory (LSTM) ( Hochreiter & Schmidhube r , 1997 ) that can retain info rmation ov er long timescales. Secondly , it is well k nown that deep er networks can mo r e efﬁciently approx imate a bro a der r ange of function s ( Bengio et al. , 2007 ; Bianchini & Scarselli , 20 14 ). While RNNs ar e deep in time , they ar e lim ited in the number o f no n-linearities applied to recent inputs. T o in crease depth , there has been extensiv e work on stacking RNNs into mu ltiple layers ( Schmidhu ber , 1 992 ; Bengio , 20 09 ). In vanilla stacked RNNs, each layer ap - plies a non- lin earity and passes info rmation to the next layer, while also maintaining a recu rrent connection to it- self. T o effectiv ely propag ate gradients across the hierar- chy , skip or shor tcut conne c tio ns can be used ( Raiko e t al. , 2012 ; Grav es , 2013 ). Alternatives like recurrent highway networks ( Zilly et al. , 20 17 ) introd uce non -linearities be- tween timesteps thr ough “ m icro-ticks" ( Grav es , 2 016 ). Pas- canu et al. ( 2014 ) incr ease dep th by addin g feedfor ward layers between state-to-state transitions. Ga ted feedb ack networks ( Chung et al. , 2015 ) allo w for layer-to-layer inter- actions between adjacent timesteps. All these variants thus introdu c e topolog ical modiﬁcations to retain inf o rmation over longer tim escales and mo del h ierarchical temporal de- penden cies. Another development is the bidirectiona l RNN (Bi-RNN) ( Schuster & Paliw al , 199 7 ; Grav es & Schmidhub er , 20 05 ). While RNNs are inh erently cau sal, Bi-RNNs model acausal interactions by processing sequences in bo th forward and backward directions. They achieve state-of-the- a rt perfor- mance on parts-of- sp eech tagging ( Plank et al. , 2016 ) and sentiment a n alysis ( Baziotis et al. , 20 1 7 ), demonstrating that some natu r al languag e processing ( NLP) tasks beneﬁt greatly from combining p ast and future inputs. The suc c esses of th ese RNN architectural variants seem to Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k derive from two common proper ties: depth and acau sality . In this paper we investigate the delayed-recurrent neural network ( d- RNN) , an extremely simp le variant tha t adds both depth a nd a causality to the RNN. The d-RNN is a single-layer RNN that imp oses dep th in time by delaying the outpu t of the model. W e a n alyze the d-RNN and prove that when it is co nstrained with sparse weights, the model is equiv alent to a stacked RNN. Further, noting that the de lay introdu c es acausal processing, we use a d -RNN to ap prox- imate bidirectional recurre n t n etworks. W e show empiri- cally that a d-RNN has the capab ility to solve so me tasks similarly to stacked an d b id irectional RNNs, and o utper- form th em in others. Addition ally , we sho w that e ven if the d-RNN appro ximation carries some er r or, this mod el ca n provide m u ch faster r u ntimes than alternati ves. 2. Backgroun d Giv en a sequential input { x t } t =1 ...T , x t ∈ R q , a single- layer RNN is deﬁned b y: ˆ h t = f  ˆ W x x t + ˆ W h ˆ h t − 1 + ˆ b h  , (1) ˆ y t = g  ˆ W o ˆ h t + ˆ b o  , (2) where f ( · ) an d g ( · ) are element-wise activati on fun ction such as tanh and softmax , ˆ h t ∈ R n is the hidden state at timestep t with n un its, and ˆ y t ∈ R m is the network output. Learned param eters include input weigh ts ˆ W x , recurr ent weights ˆ W h , bias term ˆ b h , outp ut weights ˆ W o , and bias term ˆ b o . The initial hidd en state is deno te d ˆ h 0 . Stacked recu rrent units are typically used to pr ovide depth in RNNs ( Schmidhu ber , 1992 ; Bengio , 2 009 ). Based o n Eq. ( 1 ) and ( 2 ), a stacked RNN with k layers is g iv en by: h (1) t = f  W (1) x x t + W (1) h h (1) t − 1 + b (1) h  , i = 1 (3) h ( i ) t = f  W ( i ) x h ( i − 1) t + W ( i ) h h ( i ) t − 1 + b ( i ) h  , i = 2 . . . k (4) y t = g  W o h ( k ) t + b o  , (5) where the activ ation function and par ameterization follow the single - layer RNN. Separate weights and bias terms for each layer i are gi ven by W ( i ) x , W ( i ) h , and b ( i ) h . The hid den state for this lay er at timestep t is h ( i ) t . The stacked RNN has initial hidden state vectors h (1) 0 . . . h ( k ) 0 correspo n ding to the k lay ers. The hat op erator is used for vectors and matrices in th e single-layer RNN, while tho se without are for the stacked RNN. 3. Delayed-Recurr ent Neural Network One way to incr e ase dep th in RNNs is to stack r ecurren t lay - ers, as suggested ab ove. An a lternative is to con sider time Delay d = 2 h 0 x 1 h 1 x 2 h 2 x 3 h 3 y 3 z 1 x T h T y T z T-d x T+1 [NULL] h T+1 y T+1 z T+1-d x T+d [NULL] h T+d y T+d z T Figure 1. A delayed-recurrent neural network (d-RNN) process- ing a sequence of T e lements. The output is delayed by d = 2 timesteps. T he ﬁrst output element is in ˆ y 3 and the last in ˆ y T + d . The input sequence has d additional elements, such as ‘[ NULL]’ symbols. During training, the outputs are compared with the T elements of the labeled sequence { z j } j . as a means to increase d epth within a single - layer RNN. Howe ver , sing le - layer RNNs are lim ited in th e num ber of non-lin e arities ap plied to recent inputs: there is a sing le non-lin e arity b etween the most recent input x t and its re- spectiv e outp ut ˆ y t . Pre viou s efforts ( Pascanu et al. , 201 4 ; Graves , 2016 ; Zilly et al. , 201 7 ) overcame this limitatio n by incor porating inte r mediate non- linearities b e tween in- put elements in different ways. These solution s a d d com- putational steps between elements in the sequence, gr eatly increasing r untime complexity . In this work, we explor e the delayed- r ecurren t n eural n etwork (d-RNN), in wh ich ef fec- ti ve depth is increa sed by introducin g a “d elay” between the input and output. Formally , we deﬁne a d- RNN to be a single-layer recu rrent neural network as in E quations ( 1 ) and ( 2 ), such th at f or any inpu t x t the resp e ctiv e o utput is obtained in ˆ y t + d , i.e., d timesteps later (Figure 1 ). W e refer to d as the “delay” of the network. The initial hidde n state, ˆ h 0 , for a d -RNN is initialized in the sam e manner as an RNN. Delaying the output requ ires special co n siderations on the data that differ slightly from an RNN. Input sequences need to have T + d elem e nts instead of T . Dep ending on the task being solved, this can be achiev ed b y adding a “nu ll” inp ut element (e.g ., the zer o vector), o r includin g d additio nal elements in the input sequence. When doing a forward pass over the d-RNN for in ference, o utputs fr om t = 1 to d are discar ded as we expect the o utput for x 1 to be at ˆ y 1+ d . The o utput sequ ence goes fr om ˆ y 1+ d to ˆ y T + d , and has T elements. T raining loss is c omputed by comp aring z t , th e expected output fo r inp ut x t , with ˆ y t + d . Th us, gradien ts are back - propag ated on ly from delayed outputs ˆ y 1+ d , . . . , ˆ y T + d . In this way , any modiﬁed r ecurren t ce ll, such as an LSTM or Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k GR U, can be traine d with delayed outp ut to obtain a de- layed version o f th e architectur e, e.g., d-LSTM or d-GR U. 3.1. Complexity Consider an RNN with n units, where inp ut eleme n ts have dimen sion q , a nd output elements have dimen - sion m . Computing one timestep of th is RNN re- quires thre e matrix-vector multiplication s with complex- ity O  nq + nm + n 2  . Apply ing the non-linear function s f ( · ) and g ( · ) requires O ( m + n ) . Hence, each step of this RNN h as r u ntime com p lexity of O  nq + nm + n 2  . F or a sequence o f length T , the overall compu tational effort is O  T ( nq + nm + n 2 )  . For a d-RNN, th e numb e r of timesteps is increased by the delay d , g i vin g total r untime complexity o f O  ( T + d )( nq + nm + n 2 )  . While the d-RNN incurs some cost, it is c heaper than alternative method s such as micro-steps ( Graves , 201 6 ; Zilly et al. , 2 017 ), wher e add itio nal timesteps are inserted between each pair of elements in both the input and outp ut sequences. The runtime complexity for each micr o-step is similar to an RNN step, le a ding the micro -step mod el co m- plexity to grow with th e num ber of micro- steps d pr opor- tionally to O ( dT ) . In contrast, th e d-RNN mo del complex- ity only grows propo rtionally to O ( d + T ) . 3.2. Stacked RNNs are d-RNNs The m athematical structure of a stacked RNN is similar to a single - layer RNN with the additio n of between -layer con- nections th at add depth. Here we show that any stacked RNN can be ﬂattened into a single-la y er d -RNN that pro - duces the exact sequence of hid den states an d outpu ts. W e exchange the dep th from the between-lay er co n nections with tempor a l d epth applied throu gh a delay in the outp ut. T o illustrate th is, we rewrite the parameters of a single-lay er RNN using th e weights and bias terms of a k - layer stacked RNN from Equations ( 3 )-( 5 ): ˆ W h =             W (1) h 0 · · · 0 W (2) x W (2) h 0 . . . . . . . . . . . . . . . . . . W ( i ) x W ( i ) h . . . . . . . . . 0 0 · · · 0 W ( k ) x W ( k ) h             , (6) ˆ b h =     b (1) h . . . b ( k ) h     , ˆ W x =      W (1) x 0 . . . 0      , (7) ˆ W o =  0 · · · 0 W o  , ˆ b o = b o , (8) where ˆ W x ∈ R kn × q are the inpu t weights, ˆ W h ∈ R kn × kn the recurren t weights, ˆ b h ∈ R kn the biases, ˆ W o ∈ R m × kn the output weights, and ˆ b o ∈ R m the output biases. One can see fro m Eq. ( 6 )-( 8 ) that each laye r in the stacked RNN is conv erted into a grou p of u nits in the sing le- layer RNN. The bloc k bidiago nal structur e o f the recurren t weight matrix ˆ W h makes th e hidden state act as a buf fer, where each gr oup of u nits only receives input fro m itself and the previous grou p. In formatio n proc essed thro ugh this buf fering mech anism ev entually arr ives at the o utput after k − 1 timestep s. In fact, th e obtained mod el is a d-RNN with delay d = k − 1 and spar sely c o nstrained weights. Note that the d -RNN p erform s th e sam e comp utations as the stacked version by tr ading depth in layers for depth in time . Next, we deﬁne the following notation : for a vector v ∈ R kn with k blocks, the subvector v { i } ∈ R n refers to its i th block following the partition f rom Equation s ( 6 )-( 8 ). W e now prove that a d-RNN parametr iz e d b y Eq. ( 6 )-( 8 ) is ex- actly equivalent to th e stacked RNN in Eq s. ( 3 )-( 5 ). The proof can be extended to mo re comp lex recurrent cells. W e include a proof f or LSTMs in the supplementary material. Theorem 1. Given an inp ut sequ ence { x t } t =1 ...T and a stack ed RNN with k layers d eﬁned b y Equation s ( 3 ) - ( 5 ) with activatio n fun ctions f ( · ) an d g ( · ) , a nd in itial states { h ( i ) 0 } i =1 ...k , the d-RNN with delay d = k − 1 , deﬁned b y Equation s ( 6 ) - ( 8 ) and initialized with ˆ h 0 such that ˆ h { i } i − 1 = h ( i ) 0 , ∀ i = 1 . . . k , pr oduces the same outpu t sequence but delaye d by k − 1 timesteps, i.e., ˆ y t + k − 1 = y t for a ll t = 1 . . . T . Further , the sequ ence of hidd en states at e a ch layer i ar e equ ivalent with delay i − 1 , i.e., ˆ h { i } t + i − 1 = h ( i ) t for all 1 ≤ i ≤ k and t ≥ 1 . Pr oo f. See Section 1 o f the supplementary material.  Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k Theorem 1 m akes an assump tion that ˆ h 0 in the d-RNN can be initialized such that it a c hieves ˆ h { i } i − 1 = h ( i ) 0 for all block s. Lemm a 1 below implies that initialization fo r the d- RNN with con strained we ig hts can always be com- puted f rom the stacked RNN. The intu ition beh ind it is that we can comp ute recursively from ˆ h { i } i − 1 = h ( i ) 0 to ˆ h { i } 0 for block i , while in verting th e ac ti vation fu nction. All com- monly used activation f unction s are surje c tive, th us it is enoug h to know the right-inverse of the activation func tio n f ( · ) ( see pro of of Lemm a). For example, when f ( · ) is the ReLU, the right-inverse is the identity func tion r ( d ) = d . Lemma 1. Let f : R → D be a surjective a ctivation function that maps elements in R to elements in interval D . Also, let h ( i ) 0 ∈ D n for i = 1 . . . k b e the hidden state initialization for a stacked RNN with k layers as d eﬁned in ( 3 ) - ( 4 ) . Then, there exists an initial hidd en state vector ˆ h 0 ∈ R kn for a single- layer network in Equ ations ( 6 ) - ( 7 ) such that ˆ h { i } i − 1 = h ( i ) 0 ∀ i = 1 . . . k . Pr oo f. See Section 2 o f the supplementary material.  From this theor em we see that k -layer stacked RNNs can be p erfectly e xp r essed as a single-lay er d-RNN. In th is case, the d-RNN has a speciﬁc sparsity structure in its weig ht ma- trices that is not present in the generic RNN or d-RNN. As the stacked RNN an d the d-RNN with sparsely c onstrained weights models ar e equ i valent, ther e is no difference in fa- vor of which one is used in practice, an d their r untime com- plexities are the same 1 . Moreover , they are interch angeable using the weight matrix d eﬁnitions in Equations ( 6 )-( 8 ). 3 . 2 . 1 . R E L A T I O N T O O T H E R T O P O L O G I E S Suppose one takes a weight con strained d-RNN an d adds non-ze r o elements to region s not popu lated by weig hts in Eq. ( 6 ). Th ese no n -zero weig hts do n ot corresp ond to ex- isting co nnections in the stacked RNN. So what do they correspo n d to? T o explore this q uestion we illustrate a 4-layer stacked RNN in Figure 2 (a ) . Here, solid arrows show the stan - dard stacked RNN connectio ns. The d-RNN we ight m a tr i- ces ˆ W h , ˆ W x , and ˆ W o are shown in Figure 2 (b), wh e re the colo r of each block match es the correspo nding arrow in Figu r e 2 (a) . Blocks on the main diagon al of ˆ W h con- nect group s o f units to themselves rec u rrently , wh ile blo cks on the sub diagon al correspo nd to connectio ns between lay- ers in th e stacked RNN. More g enerally , block ( i, j ) in ˆ W h correspo n ds to a con nection fro m h ( j ) t to h ( i ) t + j − i +1 in the stacked RNN. Thus, blocks in the lower triang le 1 Their r untime complex ities are the same as we can always obtain a version w i th reduced comp utational ef fort f or one model by exe cuting the other and translating the result. (i.e. i > j + 1 ) correspo nd to connection s that p oint b ack- wards in time, and fr o m a lo wer layer to a higher layer . For example, the oran ge block (3 , 1) in Figure 2 (b) (and the dashed orange lines in Figur e 2 ( a)) connects layer 1 at time t to layer 3 at time t − 1 . Conversely , block s in the uppe r triangle (i.e. j > i ) point forward in time and from a h igher layer to a lower layer . For example, the red b lock (3 , 4) in Figur e 2 (b) (and th e dashed red lines in Figure 2 ( a)) connects layer 4 at time t to layer 3 at time t + 2 . Thus we see that addin g weigh ts to em pty r egion s in the weight constrained d- RNN can mim ic the beh avior of many stacked recurre n t architecture s that have p reviously been pro posed. Amo ng o thers, it can approx imate the In- dRNN ( Li et al. , 201 8 ), td-RNN ( Zhang et al. , 2016 ), skip- connectio ns ( Graves , 20 1 3 ), and all-to-all lay er network s ( Chung et al. , 20 15 ). Simp ly rem oving the con straints on ˆ W h during training will enable a d-RNN to learn the nec - essary stacked a r chitecture. Howe ver , u nlike an ordinary RNN, th is req uires the o utput to b e delay e d based on th e desired stacking dep th. Furth er , wh ile the sing le-layer net- work has the same total n u mber of units as the corr espond- ing stacked RNN, relaxing co nstraints on ˆ W h would mean that the single-layer w ou ld have many more parameters. 3.3. A pproximating Bidir ectio na l RNNs W e p reviously showed how a d-RNN can be mad e equ iv- alent to a stacked RNN by c onstraining its weig ht ma- trices. W ithout these constraints, the d-RNN h as the ability to peek at “futu re” inputs: it comp u tes th e de- layed outpu t for time t at ˆ y t + d using also the inpu ts x t +1 , . . . , x t + d that are beyon d timestep t . A similar idea was used in the past as a baseline for bidir ectional recurrent neural networks (Bi-RNNs) ( Schuster & Paliw al , 1997 ; Graves & Schmidhub er , 2 005 ). These paper s showed th at Bi-RNNs were super ior to d- RNNs fo r r elativ ely simple problem s, but it is not clear that this c omparison holds true for pro blems that re q uire mor e n on-linea r solutio n s. If a recurren t network can compu te the o utput for time t by exploiting future input elem ents, what cond itions ar e nec- essary to approx imate its Bi-RNN coun terpart? Mo reover , can the d-RNN obtain th e same re sults? An d, g iv en these condition s, is there a beneﬁt to using the d-RNN in stead of the Bi-RNN? Figure 3 shows the numb e r of non -linear transf o rmations that each network can apply to any input element b efore computin g the outp ut at timestep t 0 . The generic RNN pro- cesses only past inp uts ( t ≤ t 0 ), and the num ber o f non- linearities dec reases for inpu ts closer to timestep t 0 . Th e Bi-RNN has identical beh avior for causal in puts but is au g- mented symm etrically for acausal inputs. I n con trast, the d-RNN has similar behavior for the causal inputs but with a higher number of non-linearities. This trend continues for Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k (a ) (b) Figure 2. A stack ed RNN is equiv al ent to a single-layer d-RNN under the gi ven sparse weight constraints. T he d-RNN prod uces the same representations as the stacked network. (a) Stacked RNN with k = 4 layers where connections show the dif ferent weight parameters. (b) W eights of the d-RNN that are equi valent to connec tions in t he stacked RNN. 1 d+1 Number of non-linearities Timestep relative to current input ( Δ t) max(1-Δt, 1+Δt) RNN d-RNN (d+1 layer stacked RNN) Bi-RNN d+1-Δt d-RNN has more non-linearities Bi-RNN has more non-linearities Figure 3. Number of non-linearities that can be applied to past and future sequence elements wit h respect to current input ( ∆ t =0). The d-RNN only sees d steps i nto the future. the ﬁrst d a c ausal inp u ts with a decr easing num ber of non- linearities until th e numb er reache s zero at t = t 0 + d + 1 . In order for a d-RNN to have at le a st as many non -linearities as a Bi-RNN fo r every elemen t in a sequence, it would need a d elay that is twice the sequence length. Howe ver , a d- RNN co uld beat a Bi-RNN when th e non-lin ear inﬂuence of nearb y acausal inpu ts on the learn ed func tio n is larger than elements farther in the f uture. In these cases, stacking Bi-RNNs would b e needed to ac hieve the same objectiv e. Using a d-RNN to app roximate a Bi-RNN c a n also d e- crease com putation a l cost. For a sequen ce of leng th T , a stacked Bi-RNN needs to compute both f orward and b ack- ward RNNs fo r each lay er b efore it can co mpute the n ext one. This synchron ization requirement h inders par a lleliza- tion an d increases ru ntime. In co n trast, the for ward-pass for the d-RNN takes T + d steps, but does n ot suffer from synchro n ization. Th us in high ly parallel hardware su ch as CPUs and GPUs, the run time of a k -la y er stacked Bi- RNN should be at least k times slower th an an RNN or d-RNN. Beyond computation al costs, d-RNNs can also be used where it is critical to output values in ( near) realtime applications ( Guo et al. , 2016 ; Ar ik et al. , 20 17 ). A d-RNN requires only the last d elements and a hidden state to co m- pute a new value, wh ereas b idirectional ar chitectures need to process an entire backward pass of the sequence. 4. Experiments W e test the capab ilities of the d-RNN in fo ur experiments designed to shed mo re ligh t o n the relationship s between d-RNNs, RNNs, Bi-RNNs, an d stacked networks. For this purpo se, the RNN imple m entation we use is a LSTM net- work, which av oids v anishing gradients an d retains more in- formation over long period s. The delay e d LST M networks are deno te d as d-L STMs. T o train each d-LSTM , the input sequences are padded at the end with zero-vector s and loss is comp u ted b y ignor ing th e ﬁrst “d elay” timesteps, as ex- plained in Section 3 . All mo dels ar e trained u sing th e Adam optimization algo rithm ( Kingma & Ba , 201 5 ) w ith learn- ing rate 0 . 001 , β 1 = 0 . 9 , an d β 2 = 0 . 999 . Durin g train- ing, th e gradien ts are clipped ( Pascanu e t al. , 2013 ) at 1.0 to av oid explosion s. Exp e riments were implemented using PyT o rch 1.1.0 ( Paszke et al. , 2017 ), and code can be found at http://www.a nonymous.com /anonymous . 4.1. Sequence Rev ersal First, we propose a simple test to illustrate how the d- LSTM can interpo late between a regular LSTM and Bi- LSTM. In this test we r equire the r ecurren t architectu res to output a sequ ence in reverse order while r e a ding it, i.e. y t = x T − t +1 for t = 1 , .., T . Solv ing th is task per- Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k fectly is o nly possible when a network has acausal access to the sequenc e . Mo r eover , depend ing on h ow many acausal elements a network can access, it is possible to analytically calculate th e expected maximum performan ce th at th e net- work can achieve. Gi ven a sequ ence of length T with e le - ments from a vocabulary { 1 , ..., V } , a causal network such as th e regular L STM can outp ut the second half of the ele- ments correctly and guess th ose in the ﬁrst half with prob- ability 1 / V . When a network has access to d acausal ele- ments it can star t outputting correct ele m ents befor e reac h- ing the h alfway po int, and can achieve an expected tru e pos- iti ve r a te (TPR) o f 1 2  1 + 1 V  +  d +1 2  1 T  1 − 1 V  . W e gen- erate d a ta sequences of leng th T = 20 b y un iformly sam- pling integer v alues between 1 and V = 4 . T he tr aining set consists of 10,000 sequ ences, the v alidation set 2,0 00, and test set 2,000. Outpu t sequenc e s are the input sequen ces re- versed. V alues in the input sequen ces a r e f ed as one-hot vector represen tations. All networks outp ut via a linear layer with a softmax function that co nverts to a vector of V pr obabilities to wh ich c r oss-entropy loss is app lied. The LSTM and d-LSTM networks have 10 0 hidden units, while the Bi-LSTM has 70 in each direction in or der to keep the total numb er of param eters con stant. W e use ba tches of 100 sequences an d train for 1 ,000 epoc h s with early stop- ping after 10 epochs an d ∆ = 1 e-3. Figure 4 shows accuracy on this task as a fu nction of the applied delay . Th e LSTM does not use acausal informa - tion and is unable to reverse mo re than h a lf o f the in put sequence. Con versely , the Bi-LSTM has f ull access to ev- ery element in the sequ ence, and can p erfectly solve th e task. For the d -LSTM n etwork, perf o rmance increases as we incr ease the de lay in the output, reaching the same lev el as the Bi-LSTM o nce the network h as access to the en tire sequence before being r equired to produce a ny output (de- lay 19). T his experiment de monstrates th at the d-LSTM can “interp olate” b e twe en LSTM a nd Bi-LSTM by choos- ing a delay that range s between zero and the length of the input sequence. 4.2. Evaluating Network Capabilities The ﬁrst exp e riment showed h ow a d-LSTM with sufﬁ- cient delay can mim ic a Bi-LSTM. In the n ext experimen t we aim at com paring how well d- L STM, LSTM, a n d Bi- LSTM n e tworks app r oximate fun c tio ns with varying de- grees of non-linear ity and acausality . Drawing inspir ation f r om ( Schuster & Paliw al , 1997 ), we require ea c h recu rrent network to lear n the fun c tion y t = sin( γ P a j = − c +1 w j + c x t + j ) , where w is a linear ﬁlter . T he parameter γ scales the argu ment o f the sine fu nction and thus controls the degree of n o n-linear ity in the f unction : for small γ the fu nction is ro ughly linear , while for large γ the function is h ighly non-lin ear . Integers a ≥ 0 (a causal) and 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 De l a y d 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Ac c u r a c y d - L S T M v a l d - L S T M t e s t M a x Pe r fo r m a n c e L S T M Bi L S T M Figure 4. Comparison of different delay value s for a d-LSTM net- work for rev ersing a sequence. LST M and Bi-LSTM networks are sho wn for reference. The netw ork is capable of achie ving the expec ted statistical boun d. The d-LSTM with highest delay is ca- pable of solving the task as well as the Bi -LSTM. c ≥ 0 (causal) co ntrol the length of the causal and acausal portion s of the linear ﬁlter w that is applied to the in put x . W e gen erate datasets with different comb inations of γ ∈ [0 . 1 , . . . , 5 . 0] and a ∈ [0 , . . . , 10] , choosing c such th at a + c = 20 . For e ach co m bination, we generate a ﬁlter w with 20 elemen ts drawn unifo rmly in [0 . 0 , 1 . 0) , and ran- dom input seque n ces with T = 50 elemen ts dr awn from a unifor m distribution [0 . 0 , 1 . 0) . In total, there are 10 ,000 generated sequences for tra in ing, 2,000 for validation, and 2,000 for testing with each set of parameter values. The o ut- put is computed following the previous formu la a n d with zero p adding for the bor ders. W e generate 5 rep etitions o f each dataset with different ﬁlters w and inpu ts x . W e train LSTM, d-L STM with delay s 5 and 10, and Bi- LSTM networks to min imize mea n squared error ( MSE). The LSTM and d-LSTM have 10 0 hidden units and the Bi- LSTM has 70 p er network, match ing the nu mbers o f pa- rameters. A linear layer after the recurren t layer outp uts a single value per timestep. Models are train ed in batches of 100 sequ ences for 1 , 000 epo chs. Training is stopp ed if the validation MSE falls b elow 1e-5. T rain in g is rep e ated ﬁve times for each ( γ , a ) value. Figure 5 shows the average test M SE for each m odel as a function of γ (degree of input non-lin earity) and a (acausal- ity). LSTM perfo rmance (Fig. 5 (a) ) is poor ev erywh ere except wher e the ﬁlter is pur ely cau sal. Surprisingly , the network per forms q u ite well e ven when the amount of non - linearity ( γ ) is quite h igh. Th e reason fo r this seems to b e that tempor al depth enables the LST M to approxim ate this function well. Bi-LSTM p erform ance (Fig. 5 (b)) follows a similar tren d fo r the causal case ( a = 0 ) a s the f orward LSTM, but also has good per forman ce for acausal ﬁlters ( a > 0 ) when th e function is nearly linear ( γ is small). As the non-lin earity of the functio n in creases, howe ver , Bi- LSTM perfo rmance suffers. T h is o c curs because the Bi- LSTM nee ds to appro x imate a highly non-lin ear func tio n Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 0 2 4 6 8 1 0 F i l t e r Ac a u sa l i t y a 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 −4 . 0 −3 . 2 −2 . 4 −1 . 6 −0 . 8 0 . 0 l o g 1 0 M S E S c a l e γ (a) LS TM (b) Bi-LS TM (c) d-LSTM (d=5) (d) d-LST M (d=10) Figure 5. Error maps for the sine function experiment with different degrees of non-linearity (horizontal axis) and amoun ts of acausality of the ﬁlt er (vertical axis). T ested architectures: (a) LSTM, (b) Bi-LS TM, (c) d-LSTM wit h delay=5, and (d) d-LSTM with delay=10. Dark blue regions depict perfect ﬁltering (l o w error), transitioning to yellow regions with high error . with a linear com bination of its forward and backward out- puts, wh ich cannot be done with small er ror . I m proving perfor mance would require stacked Bi-LSTM laye rs. In contr a st, d-LSTM networks have excellent perfo rmance for bo th non-linea r and aca usal functio n s. The d-LSTM with delay 5 (Fig. 5 (c)) shows a clear switch in perfor- mance from acausality a = 5 to 6 . This perfectly m a tches the limit of acausal elem e nts th at the network has access to. For the d-L STM with delay 10 (Fig. 5 (d)), the network perfor ms well for acausality v alues a up to 10. An interesting outcome of this experiment is th e better per- forman ce observed f or th e d-LSTM over the Bi-LSTM. This shows tha t the d-L STM can b e a better ﬁt than a Bi- LSTM for the right task. Furthermore, the d-LSTM net- work seems to app roximate the function ality o f a stacked Bi-LSTM b y appro ximating highly non -linear func tio ns. In prac tice , this could be a great be n eﬁt f or app lica tions where there is no need to treat the who le seq uence. More- over , this could b e impossible in other cases, such as streamed data. In such cases, the d -LSTM wou ld shine over bidirection al a rchitectures. On the other hand , we expect the Bi-LSTM to perfor m be tter when the acausality n eeds for the task are lo n ger than the d elay , i.e., a > d . 4.3. Masked Character -Level Language Modeling Next we examined a languag e task which sho uld beneﬁt from acausal info rmation, masked character-le vel lang u age modeling . This task is adap ted from p revious work in train- ing bidirectiona l languag e mo dels ( Devlin et al. , 2019 ). T o generate masked sequ ences, we random ly replace each character with a ma sk to ken ( ‘[MASK]’) with 20% pro b- ability . The task of the network is to predict the correct character when it en counters a mask token. Because each sequence co ntains multiple m ask tokens, th e network will need to ﬁll in som e mask tokens conditio ned on an in- put sequence that already con tains one or more mask to- kens. This can be thought of as a signal reconstru ction task: when sequ ential inp uts are rand omly degraded, how well can the network recover th e true signal? Acau sal infor- mation clearly helps with this re c o nstruction . For example, the missing letter in the seque nce “hik[MASK]n g” is ea sier to predict than th e sequence “hik[MASK]”. W e used text 8 , a clean 100M B samp le of English W ikipedia text ( Mahoney , 200 6 ) which consists of 2 7 char- acters (the English alp habet an d space s). The inp ut data contained an extra 28th mask char acter . The se 2 8 charac- ters wer e mapped to an in put em bedding layer o f dim en- sion 10. The outpu t laye r was independen t of the in put em- bedding , and only co nsisted of the 27 non-ma sk characters. Follo wing p revious work ( Mikolov et al. , 2 012 ), the ﬁrst 90M character s for m ed the training set, the n ext 5M the validation set, an d the last 5M the test set. All models wer e trained with a sequ ence len gth of 180 char a cters, in min i- batches of 128 seq uences for a total o f 20 epochs. Su c cess on the task is measure d by calculating bits-per-character (BPC) f or the m ask tokens o nly . W e me a su red forward- pass runtimes on a N v idia T itan V GPU and report a verage time to process a m ini-batch. The results are summa r ized in T able 3 . As expected , the stacked Bi-LSTMs achieve the lowest BPC. Howev er, as th e nu mber of layers increases, the inferenc e runtime also increases beca use o f the syn chroniza tio n needed be- tween layers. Notably , d -LSTMs with interm e diate de lay s achieve a BPC that is within 5 % of the Bi-LSTM with at least 4 × faster run time. Since all of the d - LSTMs have a single layer, in ference runtim e remains constant as the d e- lay an d the ca p acity o f these n etworks incr e a ses. W e ﬁnd similar results fo r other network capacities (see supp le m en- tary material). Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k T able 1. Performance of dif ferent networks on the masked character-lev el language modeling task in bits per character (BPC); lower is better . Mean and standard de viation v alues are computed ov er 5 repetitions of trai ning and inference runtime on the test set. M O D E L L A Y E R S D E L AY U N I T S / L AY E R P A R A M S . V A L . B P C T E S T B P C R U N T I M E L S T M 1 - 1 0 2 4 4 2 7 1 4 1 1 2 . 003 ± 0 . 003 2 . 075 ± 0 . 002 3 . 44 ms ± 0 . 09 L S T M 2 - 5 9 4 4 2 8 3 6 4 1 2 . 015 ± 0 . 005 2 . 087 ± 0 . 0 05 4 . 93 ms ± 0 . 13 L S T M 5 - 3 4 3 4 2 7 2 3 7 2 2 . 091 ± 0 . 016 2 . 155 ± 0 . 0 14 17 . 22 ms ± 0 . 62 B I - L S T M 1 - 7 2 2 4 2 7 8 8 7 9 0 . 977 ± 0 . 004 1 . 037 ± 0 . 0 04 4 . 97 ms ± 0 . 07 B I - L S T M 2 - 3 6 3 4 2 7 7 1 7 3 0 . 633 ± 0 . 003 0 . 677 ± 0 . 002 13 . 72 ms ± 0 . 31 B I - L S T M 5 - 2 0 2 4 2 8 7 1 5 1 0 . 637 ± 0 . 003 0 . 677 ± 0 . 0 04 29 . 18 ms ± 0 . 23 D - L S T M 1 1 1 0 2 4 4 2 7 1 4 1 1 1 . 332 ± 0 . 001 1 . 390 ± 0 . 0 01 3 . 29 ms ± 0 . 22 D - L S T M 1 5 1 0 2 4 4 2 7 1 4 1 1 0 . 708 ± 0 . 005 0 . 755 ± 0 . 0 04 3 . 39 ms ± 0 . 08 D - L S T M 1 8 1 0 2 4 4 2 7 1 4 1 1 0 . 662 ± 0 . 002 0 . 706 ± 0 . 003 3 . 3 6 ms ± 0 . 08 D - L S T M 1 10 1 0 2 4 4 2 7 1 4 1 1 0 . 666 ± 0 . 004 0 . 709 ± 0 . 004 3 . 5 6 ms ± 0 . 10 T able 2. Parts-of-Speech performance for German, English, and French languages. The models are composed of two subnetw orks at character-le vel and word-le vel. Best bidirectional network and best forwa rd-only netw ork are marked in bold for each language. L A N G U AG E C H A R - L E V E L N E T W O R K W O R D - L E V E L N E T W O R K V A L I D AT I O N A C C U R A C Y T E S T A C C U R A C Y L S T M L S T M 92 . 05 ± 0 . 16 91 . 58 ± 0 . 11 G E R M A N D - L S T M D E L AY = 1 D - L S T M D E L AY = 1 93 . 48 ± 0 . 31 92 . 87 ± 0 . 24 D - L S T M D E L AY = 1 B I - L S T M 93 . 93 ± 0 . 06 93 . 39 ± 0 . 18 B I - L S T M B I - L S T M 93 . 88 ± 0 . 13 93 . 15 ± 0 . 08 L S T M L S T M 92 . 05 ± 0 . 13 92 . 14 ± 0 . 10 E N G L I S H D - L S T M D E L AY = 1 D - L S T M D E L AY = 1 94 . 57 ± 0 . 08 94 . 57 ± 0 . 14 D - L S T M D E L AY = 1 B I - L S T M 94 . 94 ± 0 . 07 94 . 95 ± 0 . 06 B I - L S T M B I - L S T M 94 . 85 ± 0 . 05 94 . 84 ± 0 . 08 L S T M L S T M 96 . 67 ± 0 . 07 96 . 10 ± 0 . 11 F R E N C H D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 1 97 . 49 ± 0 . 04 97 . 04 ± 0 . 13 D - L S T M W I T H D E L AY = 1 B I - L S T M 97 . 67 ± 0 . 07 97 . 23 ± 0 . 12 B I - L S T M B I - L S T M 97 . 63 ± 0 . 06 97 . 22 ± 0 . 11 4.4. Real-W orld Part- of-Speech T agging In the previous experiments, we show tha t d-LSTM is capa- ble of appro x imating and even o utperfo rming a Bi-LSTM in some cases. In practice, however , the elem ents in a se- quence may have different forward an d back ward relations. This poses a challeng e for delay e d network s th at are con- strained to a speciﬁc d elay . If the delay is too low , it m ay not b e enou g h for some lo ng depend encies between ele- ments. If it is too high, the network may forget information and require higher capacity (an d maybe training d a ta). This is pr ev alent in several NLP tasks. T h erefor e we compare the p erforma nce of the d-L ST M with a Bi-LSTM on an NLP task wher e Bi-LSTMs achiev e state- o f-the-ar t perfor- mance, the Part-o f -Speech (POS) tagg in g task ( Ling et al. , 2015 ; Ballesteros et al. , 2015 ; Plank et al. , 2016 ). Th e task in volves pro c essing a variable length sequen ce to predict a POS tag (e.g. No un, V erb) p er word, using th e Un iv er- sal Depend encies (UD) ( Nivre et al. , 201 6 ) dataset. M ore details can be found in the sup plementar y material. The dua l Bi-LSTM ar chitecture propo sed by Plank et al. ( 2016 ) is followed to test the approximatio n cap acity of the d-LSTMs. In th is mo del, a word is en coded using a com- bination of word embedding s and character-level enc o ding. The enco d ed word is fed to a Bi-LSTM followed by a lin- ear lay er with softmax to pr oduce POS tags. Th e character- lev el encod ing is pr oduced by ﬁr st compu ting th e emb ed- ding of each character and then feeding it to a Bi-LSTM. The last h idden state in each direction is co ncatenated with the word embedding to fo rm the ch aracter-le vel enco d ing. The charac te r-level Bi-LSTM has 100 u nits in each direc - tion and the LSTM/d-LSTMs hav e 200 units to generate encodin g s of the same size. For the word-level subnetwork, the hid den state is of size 18 8 for the Bi-LSTM, and 300 units for the LSTM/d-LSTM to matc h the numb er of pa- rameters. The networks ar e trained for 20 epochs with cross-entro py loss. W e train comb inations of networks with delays 0 ( LSTM), 1, 3 , a n d 5 f o r the ch aracter-le vel sub- network, and delay s 0 thro ugh 4 f or the word-level. Each network has 5 r epeats with random initializatio n. Results are presented in T ab le 2 . For br evity , we includ e a subset of the comb inations for eac h language (the comp lete table can be fou n d in th e supplemen tary material). For the Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k character-lev el model, LSTMs without delay yield reduced perfor mance. Howev er, replacing o n ly the char a cter-le vel Bi-LSTM with a LSTM does not affect th e p erform ance (suppleme n tary mater ial). This suggests that only the w ord - lev el subnetwork beneﬁts from acausal eleme n ts in the sen- tence. I nterestingly , using a d-LSTM with delay 1 fo r the character-lev el network ach iev es a small improvement over the dou ble-bidir ectional mod el in English and Ger m an. Re- placing the word -level Bi-LSTM with an LSTM decreases perfor mance signiﬁcantly . Howe ver, using even a d-LSTM with delay 1 improves perfo rmance to within 0 . 3% of th e original Bi-LSTM model. 5. Conclusions In th is paper we analyze the d-RNN, a sing le layer RNN where the o u tput is delayed relati ve to the input. W e show that this simple mo diﬁcation to the classical RNN adds both depth in time an d acau sal p r ocessing. W e prove th at a d - RNN is a supe rset of stacked RNNs, which are freq uently used for sequen ce problems: a d-RNN with output delay d and speciﬁc co nstraints on its weights is exactly equiv alent to a stacked RNN with d + 1 layers. W e also show that the d-RNN can app roximate bidirec tional RNNs and stacked bidirection al RNNs because the delay allo ws the model to look at futur e as well as past in p uts. In sum, w e fou nd that d-RNNs are a simple, elegan t, an d computation ally ef- ﬁcient alternative that captur e s many of the best features of different RNN architec tu res wh ile a voiding many down- sides. Refer ences Al-Rfou’, R., Perozzi, B., and Skie n a, S. Polyglot: Distributed word repr esentations for mu ltilingual NLP. In Pr oceed ings of the S eventeenth Con fer - ence on Comp utationa l Natural La nguage Learn- ing , pp. 183– 192, Soﬁa, Bulgaria, Aug ust 2013. Association for Comp u tational Lin guistics. UR L https://www. aclweb.org/a nthology/W13- 3520 . Amodei, D. , Anantha n arayana n , S., An ubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanza r o, B., Cheng, Q., Chen, G., et al. Deep speech 2 : End -to-end speech reco gnition in english an d mand arin. In Inter- nationa l confer ence on machine learn in g , pp. 173 –182, 2016. Arik, S. O. , Chrzanowski, M., Coates, A., Diamos, G., Gib - iansky , A., Kang, Y ., Li, X . , Miller, J., Ng, A., Raiman, J., Sengupta, S., and Shoeybi, M. Deep v oice: Real-time neural text-to-speech. In Pr oceedin gs of the 34 th In - ternational Con fer ence o n Machine Learning - V olume 70 , ICML ’17, pp . 19 5–204 . JMLR.org, 20 17. URL http://dl.ac m.org/citati on.cfm?id=3305381.3305402 . Ballesteros, M., Dy er , C., an d Smith , N. A. Improved transition-b ased parsing by modeling c haracters in- stead of w ord s with LSTM s. In Pr oceedin gs o f the 2 015 Con fer ence on Emp irical Metho ds in Nat- ural La n guage Pr ocessing , pp. 349 –359, Lisbon, Portugal, Septemb er 2015 . Association for Computa- tional Lin guistics. d oi: 10. 1 8653 /v1/D15- 1 041. URL https://www. aclweb.org/a nthology/D15- 1041 . Baziotis, C., Pelek is, N. , a n d Do ulkeridis, C. Datastorie s at seme val-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment an alysis. In Pr oc eedings of th e 11th interna tional workshop on se- mantic evaluation ( S emEval-2 0 17) , pp. 747–75 4, 2017 . Bengio, Y . Learning deep architecture s for ai. F oun- dations and T r en ds in Machine Learning , 2(1):1 –127, 2009. ISSN 19 35-82 37. doi: 10 .1561 /22000 00006. URL http://dx.do i.org/10.156 1/2200000006 . Bengio, Y ., Simard, P ., and Fr a sconi, P . Learnin g long-term depend encies with gradien t descent is d ifﬁcult. IEEE T ransactions on Neural Networks , 5(2) :157–1 66, March 1994. ISSN 104 5-922 7. doi: 1 0 .1109 /72.27 9181. Bengio, Y ., LeCun, Y ., et al. Scaling le a rning algorith ms tow ards ai. Lar ge-scale kernel machines , 3 4(5):1 – 41, 2007. Bianchini, M. and Scar selli, F . On the co mplexity o f neu- ral network classiﬁers: A co mparison between shallow and deep architectu res. I EEE T ransaction s on Neural Networks and Learning Systems , 2 5(8):1 553–1 565, Au g 2014. I SSN 2 162- 2 37X. d oi: 10 .1109 /TNNLS.201 3 . 22936 37. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y . Gated feedback recurrent neu ral network s. In P r oce e dings of the 3 2Nd International Confer ence on Interna- tional Confer ence on Machine Learning - V olume 37 , ICML ’15 , p p . 2067– 2 075. JMLR.org, 20 15. URL http://dl.ac m.org/citati on.cfm?id=3045118.3045338 . Devlin, J., Chang, M .-W ., Lee, K., and T outan ova, K. BER T: Pre- training of Deep Bidirec- tional T ransfo rmers for Langu age Under stand- ing. arXiv:1810. 04805 [cs] , May 201 9. URL http://arxiv .org/abs/181 0.04805 . Graves , A. Ge n erating sequen ces with recu rrent neur al net- works. CoRR , abs/13 0 8.085 0, 2013. Graves , A. Ada ptiv e co mputation time for recurren t neural networks. arXiv pr eprint arXiv:160 3.089 83 , 2016. Graves , A. an d Schmidhu ber, J. Framewise p honem e classiﬁcation with bidirectional ls tm and other neural netw ork architectures. Neural Networks , Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k 18(5) :602 – 610, 20 05. ISSN 08 93-6 0 80. doi: https://doi.org/1 0 .1016 /j.neunet.2005.06.042. URL http://www.s ciencedirect .com/science/article/pii/S0893608005001206 . IJCNN 2005. Guo, T ., Xu, Z., Y ao , X., Chen, H., Aberer, K., and Funaya, K. Robust online time series pred ictio n with recur r ent neural networks. In 201 6 I EEE I nternation a l Confer- ence o n Da ta Science an d Ad vanced Ana lytics (DSAA ) , pp. 816– 8 25, Oct 2016. d oi: 10 .1109 /DSAA.2016 .92. Hochreiter, S. and Schmidh uber, J. Long sh ort-term memory . Neural Computation , 9(8):1735– 1780 , 1997. doi: 10.116 2/neco. 1997.9.8.1735. UR L https://doi. org/10.1162/ neco.1997.9.8.1735 . Kingma, D. P . and Ba, J. Adam: A method fo r stochastic optimization . In 3rd Internation al Conference on Learn- ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 201 5, Con fer ence T rac k Pr oceedin gs , 20 15. URL http://arxiv .org/abs/141 2.6980 . Li, S., Li, W . , Cook, C., Zhu, C., an d Gao, Y . Ind ependen tly recurren t neural network (indrnn): Building a longe r and deeper rnn. In The IEEE Conference on Computer V ision and P attern Recognition (CVPR) , June 2018. Ling, W ., Dyer , C., Black, A. W ., Trancoso, I., Fermand ez, R., Amir, S., Maru jo, L., an d Luis, T . Finding fu n ction in fo r m: Compositional chara c ter mo dels for op en vocab ulary word representatio n. I n Pr oceeding s of the 2 015 Con fer ence on Emp irical Metho ds in Nat- ural Language P r ocessing , p p. 1520 –1530 , Lisbo n, Portugal, Septemb er 2015 . Association for Computa- tional Lin guistics. d oi: 10. 1 8653 /v1/D15- 1 176. URL https://www. aclweb.org/a nthology/D15- 1176 . Mahoney , M. Relationship of W ikipedia T ext to Clean T ext, June 2006. URL http://mattm ahoney.net/d c/textdata.html . Mikolov , T . , Sutskever , I., Deoras, A., L e, H.-S., and K ombrink , S. Subword lan guage model- ing with neural n etworks. Pr e print , 2012 . URL http://www.f it.vutbr.cz/ ~imikolov/rnnlm/char.pdf . Nivre, J., d e Marneffe, M.-C., Ginter, F ., Goldberg, Y ., Haji ˇ c, J., Man ning, C. D., McDo nald, R., Petrov , S., Pyysalo, S., Silveira, N., Tsarfaty , R., and Zeman, D. U n iv ersal depend encies v1: A mul- tilingual treeban k collection. In Pr oceeding s of the T enth Intern a tional Confer e nce on Language Resour ces and Evalu ation (LREC’1 6) , pp. 1659– 1666, Portor ož, Slovenia, May 2016 . European Languag e Resources Association (ELRA). URL https://www. aclweb.org/a nthology/L16- 1262 . Pascanu, R., Mikolov , T ., and Bengio, Y . On th e difﬁ culty of training recu rrent neural networks. I n Pr oceed ings of the 30th Internation al Con fer ence on International Con- fer ence on Machine Learning - V olume 2 8 , ICML ’1 3, pp. III–13 10–II I–1318. JMLR.org, 2013 . URL http://dl.ac m.org/citati on.cfm?id=3042817.3043083 . Pascanu, R., Gulceh r e, C., Cho, K. , and Ben gio, Y . How to construct deep recurrent n eural networks. In Pr oceed - ings of the Seco nd In ternational Co n fer ence on Learning Repr esentatio n s (ICLR 201 4) , 2014. Paszke, A., Gross, S., Chintala, S., Cha n an, G. , Y ang, E., DeV ito, Z., Lin, Z., Desmaison, A. , Antiga, L., and Lerer, A. Autom atic differentiatio n in PyT orch. I n NIP S W orkshop o n the future of gradient-based machine learn- ing softwar e & techniques , 201 7. Plank, B., Søga ard, A., and Go ldberg, Y . Multi- lingual part-of -speech tag ging with b idirectional long sh ort-term memor y models a nd au xiliary loss. In Pr oceeding s of the 54 th An nual Meeting of the Associatio n for Computation a l Linguistics (V olume 2: Sh ort P apers) , pp . 412–4 18, Berlin , Germany , Augu st 2016. Associatio n f or Computa- tional Ling uistics. do i: 10 .1865 3/v1/P16 - 2 067. URL https://www. aclweb.org/a nthology/P16- 2067 . Raiko, T ., V alpola, H., an d Lecu n, Y . Deep learning made easier by linear transform ations in pe r ceptron s. In Lawrence, N. D. an d Giro lami, M. (e d s.), Pr oceed ings of the F ifteenth Internation al Con fe rence o n Artiﬁcia l Intelligence and Statistics , volume 22 of Pr oceedin gs of Machine Learning Resear ch , pp . 9 24–9 32, La Palma, Canary Islan ds, 21 –23 Apr 2012. PMLR. URL http://proce edings.mlr.p ress/v22/raiko12.html . Schmidhu ber, J. Learn ing complex, extend ed sequ ences using the p rinciple o f history com p ression. Neural Com- putation , 4(2) :234–2 42, 1992. Schuster, M. and Paliw al, K. K. Bidirectional recurrent neu- ral networks. IEEE T ransactio ns on Sign al Pr o cessing , 45(11 ):2673 –2681, Nov 19 97. ISSN 105 3 -587 X . doi: 10.11 09/78. 6 50093. Silveira, N., Do z at, T ., d e Marneffe, M.-C., Bowman, S., Conn or , M., Bauer, J., a n d Manning , C. A gold standard dependency co rpus for English. In Pr o - ceedings of the Ninth Internation al Conference o n Language Reso u r ces a n d Evalu ation (LREC’14 ) , p p . 2897– 2904 , Reykja vik, Iceland , May 2014 . E uro- pean Lan guage Resources Associatio n (ELRA). URL http://www.l rec- conf.org/p roceedings/lrec2014/pdf / 1 0 8 9 _ P a p e r . p d f . Sutske ver , I., Martens, J., and Hinton, G. E. Generating text with recurr ent ne u ral networks. I n Pr oceed ings of Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k the 28th International Con fer ence on Machine Learning (ICML-11) , pp. 1017– 1024 , 2011. Sutske ver , I., V inyals, O., and Le, Q. V . Sequence to se- quence learning with neu ral networks. In Advan ces in neural information p r oce ssing systems , pp. 3 104– 3 112, 2014. Zhang, S., W u, Y ., Che, T ., Lin , Z., Memisevic, R., Salakhutdin ov , R. R., and Bengio, Y . Arch itec tu ral c o m- plexity measu r es of recurr ent neural networks. In L ee, D. D., Sugiyama, M., Luxburg, U. V ., Guy on, I., and Garnett, R. (eds.) , Advances in Neural Information P r o- cessing Systems 29 , pp . 18 22–18 30. Curran Associates, Inc., 2016. Zilly , J. G., Srivasta va, R. K., Koutník, J., and Sch mid- huber, J. Recur rent highway n e twork s. In Precup , D. and T eh, Y . W . (eds.), P r oceed ings of the 34th Internation al Confer ence o n Ma chine Learning , vol- ume 70 o f Pr oceedin gs of Machine Lea rning Resear ch , pp. 418 9–419 8, In ternation a l Conv ention Centre, Sydney , Australia, 06 –11 Aug 20 17. PMLR. URL http://proce edings.mlr.p ress/v70/zilly17a.html . Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k A. Theore m 1 P r oof Let us recall the n otation introduced in the main paper . W e use superscript ( i ) to refer to a we ight matrix o r vector r e- lated to layer i in a stacked n etwork, e.g. , W ( i ) h , or h ( i ) t . For a single-lay er d -RNN, we ref er to weight m a trices and related vector s with "hat", e.g ., ˆ W h or ˆ h t . Ad ditionally , we deﬁne the block notation as subvector ˆ v { i } t refers to the i -th block of vector ˆ v t composed of k block s. The b locks follow the deﬁnitio n in Equ ations (3)-(5). Pr oo f of Th eor em 1. W e prove Theorem 1 by induction on the sequence length t . First, we sh ow that for t = 1 the stacked RNN and the d-RNN with the con strained weights are equiv alent. Namely , for t = 1 we show that th e outputs and the h idden states ar e the same, i. e. ˆ y k = y 1 and ˆ h { i } i = h ( i ) 1 , respectively . Without loss of gen erality , we h av e f or any i in 1 . . . k the following: ˆ h { i } i = f { i }  ˆ W x x i + ˆ W h ˆ h i − 1 + ˆ b h  = f  ˆ W { i } x x i + W ( i ) x ˆ h { i − 1 } i − 1 + W ( i ) h ˆ h { i } i − 1 + b ( i ) h  = f  0 + W ( i ) x ˆ h { i − 1 } i − 1 + W ( i ) h h ( i ) 0 + b ( i ) h  = f  W ( i ) x · f  W ( i − 1) x ˆ h { i − 2 } i − 2 + W ( i − 1) h ˆ h { i − 1 } i − 2 + b ( i − 1) h  + W ( i ) h h ( i ) 0 + b ( i ) h  = f  W ( i ) x · f  W ( i − 1) x ˆ h { i − 2 } i − 2 + W ( i − 1) h h ( i − 1) 0 + b ( i − 1) h  + W ( i ) h h ( i ) 0 + b ( i ) h  = . . . = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x · f  W (1) x x 1 + W (1) h h (1) 0 + b (1) h  + W (2) h h (2) 0 + b (2) h  . . . + W ( j ) h h ( j ) 0 + b ( j ) h  . . . + W ( i ) h h ( i ) 0 + b ( i ) h  = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x h (1) 1 + W (2) h h (2) 0 + b (2) h  . . . + W ( j ) h h ( j ) 0 + b ( j ) h  . . . + W ( i ) h h ( i ) 0 + b ( i ) h  = . . . = f  W ( i ) x . . . f  W ( j ) x h ( j − 1) 1 + W ( j ) h h ( j ) 0 + b ( j ) h  . . . + W ( i ) h h ( i ) 0 + b ( i ) h  = . . . = f  W ( i ) x h ( i − 1) 1 + W ( i ) h h ( i ) 0 + b ( i ) h  = h ( i ) 1 , where we used the in itialization assum ption ˆ h { i } i − 1 = h ( i ) 0 for all i = 1 . . . k , and the deﬁnition o f the hidd en state in Equation s (3)-(4 ) for j − 1 blo cks, in the previous steps. In particular, we hav e for j = k , ˆ h { k } k = h ( k ) 1 . Plugging this result an d the deﬁnition o f the outpu t weights and biases in Equation (8) into Equation ( 2) for computin g the output, we o btain ˆ y k = g  ˆ W o ˆ h k + ˆ b o  = g  W o ˆ h { k } k + b o  = g  W o h ( k ) 1 + b o  = y 1 . (A.9) Which concludes the basis of th e induction. Next, we assume that ˆ h { i } t + i − 1 = h ( i ) t for all 1 ≤ i ≤ k and t ≤ T − 1 , and prove that it holds for the hidden states for all layers when t = T : ˆ h { i } T + i − 1 = h ( i ) T , ∀ 1 ≤ i ≤ k . W ithou t loss of ge nerality , we have for the hidden state ˆ h { i } T + i − 1 in Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k constrained weights single-layer d-RNN th a t, ˆ h { i } T + i − 1 = f { i }  ˆ W x x T + i − 1 + ˆ W h ˆ h T + i − 2 + ˆ b h  = f  ˆ W { i } x x T + i − 1 + W ( i ) x ˆ h { i − 1 } T + i − 2 + W ( i ) h ˆ h { i } T + i − 2 + b ( i ) h  = f  0 + W ( i ) x ˆ h { i − 1 } T + i − 2 + W ( i ) h ˆ h { i } T + i − 2 + b ( i ) h  = f  W ( i ) x · f  W ( i − 1) x ˆ h { i − 2 } T + i − 3 + W ( i − 1) h ˆ h { i − 1 } T + i − 3 + b ( i − 1) h  + W ( i ) h ˆ h { i } T + i − 2 + b ( i ) h  = . . . = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x f  W (1) x x T + W (1) h ˆ h { 1 } T − 1 + b (1) h  + W (2) h ˆ h { 2 } T + b (2) h  . . . + W ( j ) h ˆ h { j } T + j − 2 + b ( j ) h  . . . + W ( i ) h ˆ h { i } T + i − 2 + b ( i ) h  From the inductive assump tion we have ˆ h { j } T + j − 2 = h ( j ) T − 1 for all 1 ≤ j ≤ k , then it follows ˆ h { i } T + i − 1 = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x f  W (1) x x T + W (1) h h (1) T − 1 + b (1) h  + W (2) h h (2) T − 1 + b (2) h  . . . + W ( j ) h h ( j ) T − 1 + b ( j ) h  . . . + W ( i ) h h ( i ) T − 1 + b ( i ) h  = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x h (1) T + W (2) h h (2) T − 1 + b (2) h  . . . + W ( j ) h h ( j ) T − 1 + b ( j ) h  . . . + W ( i ) h h ( i ) T − 1 + b ( i ) h  = . . . = f  W ( i ) x . . . f  W ( j ) x h ( j − 1) T + W ( j ) h h ( j ) T − 1 + b ( j ) h  . . . + W ( i ) h h ( i ) 0 + b ( i ) h  = . . . = f  W ( i ) x h ( i − 1) T + W ( i ) h h ( i ) T − 1 + b ( i ) h  = h ( i ) T , where we used the deﬁnition of the hidd en states in Equa- tions (3)-(4 ). In particular, we have for i = k that ˆ h { k } T + k − 1 = h ( k ) T . Now , we show th a t ˆ y T + k − 1 = y T . By the deﬁnition of th e output weights and biases in Equ a tion (8). and by the fact that ˆ h { k } T + k − 1 = h ( k ) T , we obtain ˆ y T + k − 1 = g  ˆ W o ˆ h T + k − 1 + ˆ b o  = g  W o ˆ h { k } T + k − 1 + b o  = g  W o h ( k ) T + b o  = y T , (A.10) which completes the proof.  B. Lemma 1 Proof W e show n ext tha t th ere exists an initialization vector that allows us to initialize the equiv alent single-lay er weight constrained d-RNN as deﬁned in Th eorem 1. Pr oo f of Lemma 1. Fr om the su rjective deﬁnitio n of the a c - ti vation functio n f ( · ) , we know th at th e fun ction f ( · ) is right-invertible. Namely , ther e is a function r : D → R such that for any d ∈ D , r ( · ) satisties f ( r ( d )) = d . First, we no te th at for i = 1 , we have ˆ h { 1 } 0 = h (1) 0 . When i = 2 , we have h (2) 0 = ˆ h { 2 } 1 = f  W (2) x h (1) 0 + W (2) h ˆ h { 2 } 0 + b (2) h  . (B.11) From ( B.11 ) and the right-invertible functio n r ( · ) satisﬁes h (2) 0 = f  r  h (2) 0  , we obtain r  h (2) 0  = W (2) x h (1) 0 + W (2) h ˆ h { 2 } 0 + b (2) h = ⇒ ˆ h { 2 } 0 = W (2) h † h r  h (2) 0  − W (2) x h (1) 0 − b (2) h i , (B.12) where A † is the pseudoinv erse o f matrix A . W e assume that we obtained the initialization s for i − 1 and compute the initializatio n for block i .I n general, fo r block i we hav e h ( i ) 0 = ˆ h { i } i − 1 = f  W ( i ) x ˆ h { i − 1 } i − 1 + W ( i ) h ˆ h { i } i − 2 + b ( i ) h  Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k W e can p lug in the in itializatio n and the interm ediate com- puted hidden states for block i − 1 to obtain ˆ h { i } i − 2 = W ( i ) h † h r  h ( i ) 0  − W ( i ) x ˆ h { i − 1 } i − 1 − b ( i ) h i . W e continu e to reapp ly the recur si ve formu la on e step at a time until we reach the last step bef ore the initializatio n ˆ h { i } 0 : ˆ h { i } i − j = W ( i ) h † h r  ˆ h { i } i − j +1  − W ( i ) x ˆ h { i − 1 } i − j +1 − b ( i ) h i . . . ˆ h { i } 1 = f  W ( i ) x ˆ h { i − 1 } 1 + W ( i ) h ˆ h { i } 0 + b ( i ) h  = ⇒ ˆ h { i } 0 = W ( i ) h † h r  ˆ h { i } 1  − W ( i ) x ˆ h { i − 1 } 1 − b ( i ) h i , (B.13) Follo wing these steps from h ( i ) 0 to obtain ˆ h { i } 0 , we con- structed the initialization of the weight constrained d - RNN to ac c urately mimic th e initialization of th e stacked RNN.  C. Extension to d-LSTMs A Long Sh ort-T erm Memory recurren t cell ( Hochreiter & Schmidhube r , 1997 ) is given b y the in - troductio n of a cell state and a series of gates tha t co ntrol the up dates of the states. The cell state tog ether with the gates aim to solve the vanishing gr adients p r oblems in the RNN. The LSTM cell is highly popular an d we refer to the following imp lementation: ˆ e t = σ  ˆ W xe x t + ˆ W he ˆ h t − 1 + ˆ b e  , (C.14) ˆ f t = σ  ˆ W xf x t + ˆ W hf ˆ h t − 1 + ˆ b f  , (C.15) ˆ o t = σ  ˆ W xo x t + ˆ W ho ˆ h t − 1 + ˆ b o  , (C.16) ˆ g t = tanh  ˆ W xc x t + ˆ W hc ˆ h t − 1 + ˆ b c  , (C.17) ˆ c t = ˆ f t ⊙ ˆ c t − 1 + ˆ e t ⊙ ˆ g t , (C.18) ˆ h t = ˆ o t ⊙ tanh ( ˆ c t ) , (C.19) where ˆ e t is the in put gate, ˆ f t the forget gate, ˆ o t the output gate, ˆ g t the cell gate, ˆ c t the cell state, a nd ˆ h t the hid den state. The weig ht matrices are symbolized ˆ W xa and ˆ W ha as well as the bias ˆ b a , with a ∈ { e , c , f , o } b eing the re- spectiv e gate. Th e sy mbol ⊙ represents an elemen t-wise produ ct and σ ( · ) is the sigmoid fu nction. First, we n ote that th e set of Equation s ( C.14 )-( C.19 ) can be expanded in to the following two equatio ns: ˆ c t = σ  ˆ W xf x t + ˆ W hf ˆ h t − 1 + ˆ b f  ⊙ ˆ c t − 1 + σ  ˆ W xe x t + ˆ W he ˆ h t − 1 + ˆ b e  ⊙ tanh  ˆ W xc x t + ˆ W hc ˆ h t − 1 + ˆ b c  , (C.20 ) ˆ h t = σ  ˆ W xo x t + ˆ W ho ˆ h t − 1 + ˆ b o  ⊙ tanh ( ˆ c t ) . (C.21) Re writing the LSTM Equation s ( C.14 )-( C.19 ) in this fo r m, allows to rem ain with the r ecurren t eq uations wher e both ˆ h t and ˆ c t depend on the previous hidden and cell states, ˆ h t − 1 and ˆ c t − 1 , and the c urrent input x t . Next, we describ e th e weight matric e s for the single-layer d-LSTM that matches a stacked-LSTM with k lay ers. The matrices and biases follow the exact sam e pattern as th e RNN proof, being the same fo r all g ates. ˆ W ha =             W (1) ha 0 · · · 0 W (2) xa W (2) ha 0 . . . . . . . . . . . . . . . . . . W ( i ) xa W ( i ) ha . . . . . . . . . 0 0 · · · 0 W ( k ) xa W ( k ) ha             (C.22) ˆ b ha =     b (1) ha . . . b ( k ) ha     , ˆ W xa =      W (1) xa 0 . . . 0      , (C.2 3) where ˆ W xa ∈ R kn × q are the input weights, ˆ W ha ∈ R kn × kn the rec u rrent weights, ˆ b ha ∈ R kn the biases, for gate a ∈ { e , c , o , f } . W e follow the sam e no ta tio n f o r blocks and lay e rs introduced with Theor em 1. W e om it the equations fo r the o utput elem e nt ˆ y t as th ey ar e exactly the same as th e RNN in Theorem 1, and thus require the same steps for proving that outputs are equal, i.e., ˆ y T + k − 1 = y T . Therefo re, for the LSTM the o rem we will focus on th e hid- den and cell states. Theorem 2 . Given a n inpu t sequence { x t } t =1 ...T and a stacked LSTM with k laye rs, an d initial states { h ( i ) 0 , c ( i ) 0 } i =1 ...k , the d-LSTM with d elay d = k − 1 , de- ﬁned b y Equ ations ( C.22 ) - ( C.23 ) an d initialized with ˆ h 0 such that ˆ h { i } i − 1 = h ( i ) 0 , ∀ i = 1 . . . k and ˆ c 0 such th at ˆ c { i } i − 1 = c ( i ) 0 , ∀ i = 1 . . . k , pr od uces the sa m e o utput se- quence but d e layed by k − 1 timesteps, i.e., ˆ y t + k − 1 = y t for all t = 1 . . . T . Further , the sequen ce of hidden and cell states a t each layer i ar e equ ivalent with d elay i − 1 , i.e., Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k ˆ h { i } t + i − 1 = h ( i ) t and ˆ c { i } t + i − 1 = c ( i ) t for a ll 1 ≤ i ≤ k a nd t ≥ 1 . Pr oo f. W e prove Theorem 2 b y inductio n on the sequence length t . First, we show that for t = 1 the stacked L ST M and the d-LST M with the constrained weights are equiv- alent. Namely , for t = 1 we show that the outputs, hid- den states and cell states are the same, i.e. ˆ y k = y 1 , ˆ h { i } i = h ( i ) 1 , a n d ˆ c { i } i = c ( i ) 1 , respectively . W ithou t loss of generality , we have for any j in 1 . . . k the following: ˆ h { i } i = σ  ˆ W { i } xo x i + ˆ W { i } ho ˆ h { i } i − 1 + ˆ b { i } o  ⊙ tanh  ˆ c { i } i  = σ  W ( i ) xo ˆ h { i − 1 } i − 1 + W ( i ) ho ˆ h { i } i − 1 + b ( i ) o  ⊙ tanh  ˆ c { i } i  = σ  W ( i ) xo ˆ h { i − 1 } i − 1 + W ( i ) ho h ( i ) 0 + b ( i ) o  ⊙ tanh  σ  W ( i ) xf ˆ h { i − 1 } i − 1 + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ ˆ c { i } i − 1 + σ  W ( i ) xe ˆ h { i − 1 } i − 1 + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc ˆ h { i − 1 } i − 1 + W ( i ) hc h ( i ) 0 + b ( i ) c  = . . . = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo σ  W (1) xo x 1 + W (1) ho h (1) 0 + b (1) o  ⊙ tanh  σ  W (1) xf x 1 + W (1) hf h (1) 0 + b (1) f  ⊙ c (1) 0 + σ  W (1) xe x 1 + W (1) he h (1) 0 + b (1) e  ⊙ tanh  W (1) xc x 1 + W (1) hc h (1) 0 + b (1) c  + W (2) ho h (2) 0 + b (2) o  ⊙ tanh  σ  W (2) xf ( . . . ) + W (2) hf h (2) 0 + b (2) f  ⊙ ˆ c { 2 } 1 + σ  W (2) xe ( . . . ) + W (2) he h (2) 0 + b (2) e  ⊙ tanh  W (2) xc ( . . . ) + W (2) hc h (2) 0 + b (2) c  . . . ] + W ( j ) ho h ( j ) 0 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf h ( j ) 0 + b ( j ) f  ⊙ ˆ c { j } j − 1 + σ  W ( j ) xe [ . . . ] + W ( j ) he h ( j ) 0 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc h ( j ) 0 + b ( j ) c o · · · + W ( i ) ho h ( i ) 0 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ ˆ c { i } i − 1 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) 0 + b ( i ) c  = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo h (1) 1 + W (2) ho h (2) 0 + b (2) o  ⊙ tanh  σ  W (2) xf h (1) 1 + W (2) hf h (2) 0 + b (2) f  ⊙ c (2) 0 + σ  W (2) xe h (1) 1 + W (2) he h (2) 0 + b (2) e  ⊙ tanh  W (2) xc h (1) 1 + W (2) hc h (2) 0 + b (2) c  . . . ] + W ( j ) ho h ( j ) 0 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf h ( j ) 0 + b ( j ) f  ⊙ c ( j ) 0 + σ  W ( j ) xe [ . . . ] + W ( j ) he h ( j ) 0 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc h ( j ) 0 + b ( j ) c o · · · + W ( i ) ho h ( i ) 0 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) 0 + b ( i ) c  = . . . = σ  W ( i ) xo . . . n σ  W ( j ) xo h ( j − 1) 1 + W ( j ) ho h ( j ) 0 + b ( j ) o  ⊙ ta nh  σ  W ( j ) xf h ( j − 1) 1 + W ( j ) hf h ( j ) 0 + b ( j ) f  ⊙ c ( j ) 0 + σ  W ( j ) xe h ( j − 1) 1 + W ( j ) he h ( j ) 0 + b ( j ) e  ⊙ tanh  W ( j ) xc h ( j − 1) 1 + W ( j ) hc h ( j ) 0 + b ( j ) c o · · · + W ( i ) ho h ( i ) 0 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) 0 + b ( i ) c  = . . . Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k = σ  W ( i ) xo h ( i − 1) 1 + W ( i ) ho h ( i ) 0 + b ( i ) o  ⊙ tanh  σ  W ( i ) xf h ( i − 1) 1 + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe h ( i − 1) 1 + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc h ( i − 1) 1 + W ( i ) hc h ( i ) 0 + b ( i ) c  = h ( i ) 1 , where we used the initializatio n assumptions ˆ h { i } i − 1 = h ( i ) 0 and ˆ c { i } i − 1 = c ( i ) 0 for a ll i = 1 . . . k , and th e deﬁn ition of the hidd en and cell state in Equations ( C.20 ) and ( C.21 ) fo r j − 1 blocks, in the previous step s. In particu lar , we have for layer k that ˆ h { k } i = h ( k ) 1 , and using th e same tra ns- formation s as in ( A.9 ) with RNNs, we obtain ˆ y k = y 1 . Furthermo re, we obtained that: ˆ c { i } i = σ  ˆ W { i } xf x i + ˆ W { i } hf ˆ h i − 1 + ˆ b { i } f  ⊙ ˆ c { i } i − 1 + σ  ˆ W { i } xe x i + ˆ W { i } he ˆ h i − 1 + ˆ b { i } e  ⊙ tanh  ˆ W { i } xc x i + ˆ W { i } hc ˆ h i − 1 + ˆ b { i } c  = σ  W ( i ) xf ˆ h { i − 1 } i − 1 + W ( i ) hf ˆ h { i } i − 1 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe ˆ h { i − 1 } i − 1 + W ( i ) he ˆ h { i } i − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc ˆ h { i − 1 } i − 1 + W ( i ) hc ˆ h { i } i − 1 + b ( i ) c  = σ  W ( i ) xf h ( i − 1) 1 + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe h ( i − 1) 1 + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc h ( i − 1) 1 + W ( i ) hc h ( i ) 0 + b ( i ) c  = c ( i ) 1 Which concludes the basis of th e induction. Next, we assume that ˆ h { i } t + i − 1 = h ( i ) t and ˆ c { i } t + i − 1 = c ( i ) t for all 1 ≤ i ≤ k and t ≤ T − 1 , and pr ove that it ho ld s for th e hidd en and cell states f or all layers when t = T : ˆ h { i } T + i − 1 = h ( i ) T , ∀ 1 ≤ i ≤ k . W ithout loss o f generality , we hav e for the hidden state ˆ h { i } T + i − 1 in constrained weights single-layer d-LSTM that, ˆ h { i } T + i − 1 = σ  ˆ W { i } xo x T + i − 1 + ˆ W { i } ho ˆ h { i } T + i − 2 + ˆ b { i } o  ⊙ tanh  ˆ c { i } T + i − 1  = σ  W ( i ) xo ˆ h { i − 1 } T + i − 2 + W ( i ) ho ˆ h { i } T + i − 2 + b ( i ) o  ⊙ tanh  ˆ c { i } T + i − 1  = σ  W ( i ) xo ˆ h { i − 1 } T + i − 2 + W ( i ) ho ˆ h { i } T + i − 2 + b ( i ) o  ⊙ tanh  σ  W ( i ) xf ˆ h { i − 1 } T + i − 2 + W ( i ) hf ˆ h { i } T + i − 2 + b ( i ) f  ⊙ ˆ c { i } T + i − 2 + σ  W ( i ) xe ˆ h { i − 1 } T + i − 2 + W ( i ) he ˆ h { i } T + i − 2 + b ( i ) e  ⊙ tanh  W ( i ) xc ˆ h { i − 1 } T + i − 2 + W ( i ) hc ˆ h { i } T + i − 2 + b ( i ) c  = . . . = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo σ  W (1) xo x T + W (1) ho ˆ h { 1 } T − 1 + b (1) o  ⊙ tanh  σ  W (1) xf x T + W (1) hf ˆ h { 1 } T − 1 + b (1) f  ⊙ ˆ c { 1 } T − 1 + σ  W (1) xe x T + W (1) he ˆ h { 1 } T − 1 + b (1) e  ⊙ tanh  W (1) xc x T + W (1) hc ˆ h { 1 } T − 1 + b (1) c  + W (2) ho ˆ h { 2 } T + b (2) o  ⊙ tanh  σ  W (2) xf ( . . . ) + W (2) hf ˆ h { 2 } T + b (2) f  ⊙ ˆ c { 2 } T + σ  W (2) xe ( . . . ) + W (2) he ˆ h { 2 } T + b (2) e  ⊙ tanh  W (2) xc ( . . . ) + W (2) hc ˆ h { 2 } T + b (2) c  . . . ] + W ( j ) ho ˆ h { j } T + j − 2 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf ˆ h { j } T + j − 2 + b ( j ) f  ⊙ ˆ c { j } T + j − 2 + σ  W ( j ) xe [ . . . ] + W ( j ) he ˆ h { j } T + j − 2 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc ˆ h { j } T + j − 2 + b ( j ) c o · · · + W ( i ) ho ˆ h { i } T + i − 2 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf ˆ h { i } T + i − 2 + b ( i ) f  ⊙ ˆ c { i } T + i − 2 + σ  W ( i ) xe ( . . . ) + W ( i ) he ˆ h { i } T + i − 2 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc ˆ h { i } T + i − 2 + b ( i ) c  From the indu ctiv e assumptio n we have that ˆ h { j } T + j − 2 = h ( j ) T − 1 and ˆ c { j } T + j − 2 = c ( j ) T − 1 for all 1 ≤ j ≤ k , then it Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k follows that = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo σ  W (1) xo x T + W (1) ho h (1) T − 1 + b (1) o  ⊙ tanh  σ  W (1) xf x T + W (1) hf h (1) T − 1 + b (1) f  ⊙ c (1) T − 1 + σ  W (1) xe x T + W (1) he h (1) T − 1 + b (1) e  ⊙ tanh  W (1) xc x T + W (1) hc h (1) T − 1 + b (1) c  + W (2) ho h (2) T − 1 + b (2) o  ⊙ tanh  σ  W (2) xf ( . . . ) + W (2) hf h (2) T − 1 + b (2) f  ⊙ c (2) T − 1 + σ  W (2) xe ( . . . ) + W (2) he h (2) T − 1 + b (2) e  ⊙ tanh  W (2) xc ( . . . ) + W (2) hc h (2) T − 1 + b (2) c  . . . ] + W ( j ) ho h ( j ) T − 1 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf h ( j ) T − 1 + b ( j ) f  ⊙ c ( j ) T − 1 + σ  W ( j ) xe [ . . . ] + W ( j ) he h ( j ) T − 1 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc h ( j ) T − 1 + b ( j ) c o · · · + W ( i ) ho h ( i ) T − 1 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo h (1) T + W (2) ho h (2) T − 1 + b (2) o  ⊙ tanh  σ  W (2) xf h (1) T + W (2) hf h (2) T − 1 + b (2) f  ⊙ c (2) T − 1 + σ  W (2) xe h (1) T + W (2) he h (2) T − 1 + b (2) e  ⊙ tanh  W (2) xc h (1) T + W (2) hc h (2) T − 1 + b (2) c  . . . ] + W ( j ) ho h ( j ) T − 1 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf h ( j ) T − 1 + b ( j ) f  ⊙ c ( j ) T − 1 + σ  W ( j ) xe [ . . . ] + W ( j ) he h ( j ) T − 1 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc h ( j ) T − 1 + b ( j ) c o · · · + W ( i ) ho h ( i ) T − 1 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = . . . = σ  W ( i ) xo . . . n σ  W ( j ) xo h ( j − 1) T + W ( j ) ho h ( j ) T − 1 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf h ( j − 1) T + W ( j ) hf h ( j ) T − 1 + b ( j ) f  ⊙ c ( j ) T − 1 + σ  W ( j ) xe h ( j − 1) T + W ( j ) he h ( j ) T − 1 + b ( j ) e  ⊙ tanh  W ( j ) xc h ( j − 1) T + W ( j ) hc h ( j ) T − 1 + b ( j ) c o · · · + W ( i ) ho h ( i ) T − 1 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = . . . = σ  W ( i ) xo h ( i − 1) T + W ( i ) ho h ( i ) T − 1 + b ( i ) o  ⊙ tanh  σ  W ( i ) xf h ( i − 1) T + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe h ( i − 1) T + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc h ( i − 1) T + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = h ( i ) T , where we u se the recurrent deﬁnition of th e hidden an d cell states in Equation s ( C.20 ) and ( C.2 1 ). In particular, we obtained for i = k that ˆ h { k } T + k − 1 = h ( k ) T . Ap plying the same steps as in th e d-RNN proo f in Eq. ( A.10 ), we ob tain ˆ y T + k − 1 = y T . Last, we obtain for the cell state that ˆ c { i } T + i − 1 = σ  ˆ W { i } xf x T + i − 1 + ˆ W { i } hf ˆ h T + i − 2 + ˆ b { i } f  ⊙ ˆ c { i } T + i − 2 + σ  ˆ W { i } xe x T + i − 1 + ˆ W { i } he ˆ h T + i − 2 + ˆ b { i } e  ⊙ tanh  ˆ W { i } xc x T + i − 1 + ˆ W { i } hc ˆ h T + i − 2 + ˆ b { i } c  = σ  W ( i ) xf ˆ h { i − 1 } T + i − 2 + W ( i ) hf ˆ h { i } T + i − 2 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe ˆ h { i − 1 } T + i − 2 + W ( i ) he ˆ h { i } T + i − 2 + b ( i ) e  ⊙ tanh  W ( i ) xc ˆ h { i − 1 } T + i − 2 + W ( i ) hc ˆ h { i } T + i − 2 + b ( i ) c  = σ  W ( i ) xf h ( i − 1) T + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe h ( i − 1) T + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc h ( i − 1) T + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = c ( i ) T Which completes the proof.  Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k D. W ei g ht Constraints and Connections in d-RNN Figure 6 shows the weight con straints imposed to achieve equiv alence b etween the stacked RNN and single - layer d- RNN, and a v isualization of the d-RNN as co nnection s in the stacked RNN. Figure 6 (b) depicts the delay (or “ sh ift”) of all the h idden states as they would b e comp uted in the stacked RNN. Each layer is equ iv a lent to a shift by o ne timestep. E. Additional Plo ts f or Error Maps Figure 7 p r esent the standard deviation d iagrams fo r the error maps in Fig ure 5. F . Masked Character -Leve l Language Modeling: Additional Results In T able 3 , we includ e a d ditional results for smaller net- works of the masked language model task. W e sam- pled mor e d elay values f o r d-LSTMs, but the general con- clusions remain the same: intermediate values o f delay achieve the lowest BPC. Forward-pass runtimes across de- lay values sho w a small increase with larger delays, but the increment is relatively ﬂat co mpared to stacked LSTMs or (stacked) Bi-LSTMs as they increase in d e pth. For these experiments, we also used a batch of 128 sequ ences, and an embeddin g of dimension 10 . G. Part-of-Spee ch T agg ing: Additional Details and R esults In this section, we include m ore d etails ab out th e d ataset and the r esults o f all th e com binations f o r th e Parts-Of- Speech experim ent. W e u sed treeban ks from Univer - sal Depend encies (UD) ( Ni vre et al. , 20 16 ) version 2.3. W e selected the Eng lish E WT treebank 2 ( Silveira et al. , 2014 ) (2 54,85 4 words), Fr ench GSD treeb ank 3 (411,4 65 words), and Germ a n GSD treeban k 4 (297,8 36 words) based on th e quality assigne d by the UD au thors. W e follow the partitio n ing on to training, validation and test datasets as pre- d eﬁned in UD. A ll treeban ks use the sam e POS tag set containing 17 tags. W e use the Polyglot project ( Al-Rfou’ et a l. , 2013 ) word embedd ings (64 di- mensions). W e build our own alphab ets based on the most frequen t 1 00 char acters in the vocabularies. All th e ne t- works h ave a 100 -dimensio nal character-le vel embed d ing, which is train ed with the network. W e use a batch size of 32 sentences. 2 https://githu b.com/Univers alDependencie s/UD_English- EW T/tree/r2.3 3 https://githu b.com/Univers alDependencie s/UD_French- GSD /tree/r2.3 4 https://githu b.com/Univers alDependencie s/UD_German- GSD /tree/r2.3 Results for Germ an, Eng lish, and Fr ench can be fou nd in T ab les 4 , 5 , and 6 , respectively . The best result that does not use a bidirection al network is marked in b old for each languag e . Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k (a) (b) Figure 6. (a) W ei ghts of t he single-layer and weight constrained d-RNN that are equiv alent t o connections in the stacked RNN from Figure 2. (b) Connection s in the d-RNN based on the weight matrix in (a). The d-RNN is depicted as it would be the stacked RNN. The hidden states are delayed in time with respect to the stacked network. T able 3. Performance for smaller networks on the mask ed character-le vel language modeling task. Mean and standard de viation v alues are computed over 5 repetitions of training and inference runtime on the test set. M O D E L L AY E R S D E L AY U N I T S P A R A M S . V A L . B P C T E S T B P C R U N T I M E L S T M 1 - 5 1 2 1 0 8 7 2 8 3 2 . 139 ± 0 . 00 5 2 . 195 ± 0 . 002 2 . 85 ms ± 0 . 14 L S T M 2 - 2 9 8 1 0 9 0 6 8 9 2 . 156 ± 0 . 00 3 2 . 215 ± 0 . 00 2 6 . 69 ms ± 0 . 27 L S T M 5 - 1 7 2 1 0 8 3 7 3 5 2 . 199 ± 0 . 01 6 2 . 255 ± 0 . 01 5 11 . 32 ms ± 0 . 05 B I - L S T M 1 - 3 6 0 1 0 9 1 1 0 7 1 . 130 ± 0 . 00 3 1 . 187 ± 0 . 00 4 5 . 82 ms ± 0 . 18 B I - L S T M 2 - 1 8 2 1 0 9 0 4 8 7 0 . 800 ± 0 . 00 4 0 . 846 ± 0 . 00 5 11 . 08 ms ± 0 . 59 B I - L S T M 5 - 1 0 2 1 1 0 4 1 5 1 0 . 796 ± 0 . 00 7 0 . 841 ± 0 . 006 23 . 94 ms ± 0 . 17 D - L S T M 1 1 5 1 2 1 0 8 7 2 8 3 1 . 470 ± 0 . 00 2 1 . 518 ± 0 . 00 3 2 . 80 ms ± 0 . 02 D - L S T M 1 2 5 1 2 1 0 8 7 2 8 3 1 . 162 ± 0 . 00 4 1 . 208 ± 0 . 00 3 2 . 81 ms ± 0 . 01 D - L S T M 1 3 5 1 2 1 0 8 7 2 8 3 0 . 995 ± 0 . 00 2 1 . 039 ± 0 . 00 2 3 . 02 ms ± 0 . 23 D - L S T M 1 5 5 1 2 1 0 8 7 2 8 3 0 . 877 ± 0 . 00 1 0 . 920 ± 0 . 00 3 3 . 01 ms ± 0 . 22 D - L S T M 1 8 5 1 2 1 0 8 7 2 8 3 0 . 859 ± 0 . 00 2 0 . 905 ± 0 . 003 3 . 04 ms ± 0 . 19 D - L S T M 1 1 0 5 1 2 1 0 8 7 2 8 3 0 . 889 ± 0 . 00 4 0 . 935 ± 0 . 00 5 3 . 22 ms ± 0 . 18 D - L S T M 1 1 5 5 1 2 1 0 8 7 2 8 3 0 . 971 ± 0 . 00 4 1 . 014 ± 0 . 00 2 3 . 17 ms ± 0 . 05 Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a −4 .0 −3 .2 −2 .4 −1 .6 −0 .8 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 (a) L STM 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a −4 .0 −3 .2 −2 .4 −1 .6 −0 .8 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 (b) Bi - L STM 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a −4 .0 −3 .2 −2 .4 −1 .6 −0 .8 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 (c) d-LS TM with delay=5 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l te r A c a u sa l i t y a −4 .0 −3 .2 −2 .4 −1 .6 −0 .8 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l te r A c a u sa l i t y a 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 (d) d-LSTM with delay=10 Figure 7. Error maps presen ted in Figure 4 (left column) t ogether with their standard deviation ﬁgures. Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k T able 4. Parts-of-Speech results for German. The table sho ws all possible combinations of delays or bidirectional LSTM networks. The best forward-only network is marke d in bold. C H A R A C T E R - L E V E L N E T W O R K W O R D - L E V E L N E T W O R K V A L I D AT I O N A C C U R A C Y T E S T A C C U R A C Y B I - L S T M B I - L S T M 93 . 8 8 ± 0 . 13 93 . 15 ± 0 . 08 B I - L S T M L S T M 92 . 00 ± 0 . 16 91 . 5 0 ± 0 . 05 B I - L S T M D - L S T M W I T H D E L AY = 1 93 . 32 ± 0 . 23 92 . 81 ± 0 . 14 B I - L S T M D - L S T M W I T H D E L AY = 2 93 . 15 ± 0 . 06 92 . 67 ± 0 . 08 B I - L S T M D - L S T M W I T H D E L AY = 3 92 . 82 ± 0 . 14 92 . 25 ± 0 . 16 B I - L S T M D - L S T M W I T H D E L AY = 4 92 . 41 ± 0 . 12 91 . 95 ± 0 . 17 B I - L S T M D - L S T M W I T H D E L AY = 5 91 . 86 ± 0 . 11 91 . 57 ± 0 . 20 L S T M B I - L S T M 93 . 96 ± 0 . 12 93 . 43 ± 0 . 07 L S T M L S T M 92 . 05 ± 0 . 16 91 . 5 8 ± 0 . 11 L S T M D - L S T M W I T H D E L AY = 1 93 . 46 ± 0 . 16 92 . 71 ± 0 . 11 L S T M D - L S T M W I T H D E L AY = 2 93 . 13 ± 0 . 10 92 . 61 ± 0 . 26 L S T M D - L S T M W I T H D E L AY = 3 92 . 91 ± 0 . 13 92 . 38 ± 0 . 15 L S T M D - L S T M W I T H D E L AY = 4 92 . 56 ± 0 . 17 92 . 06 ± 0 . 19 D - L S T M W I T H D E L AY = 1 B I - L S T M 9 3 . 93 ± 0 . 06 93 . 39 ± 0 . 18 D - L S T M W I T H D E L AY = 1 L S T M 92 . 04 ± 0 . 11 91 . 58 ± 0 . 1 4 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 1 93 . 48 ± 0 . 31 92 . 87 ± 0 . 24 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 2 93 . 11 ± 0 . 18 92 . 54 ± 0 . 08 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 3 92 . 85 ± 0 . 14 92 . 28 ± 0 . 19 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 4 92 . 50 ± 0 . 12 92 . 11 ± 0 . 19 D - L S T M W I T H D E L AY = 3 B I - L S T M 9 4 . 00 ± 0 . 17 93 . 32 ± 0 . 18 D - L S T M W I T H D E L AY = 3 L S T M 92 . 10 ± 0 . 24 91 . 61 ± 0 . 1 8 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 1 93 . 29 ± 0 . 09 92 . 68 ± 0 . 09 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 2 93 . 09 ± 0 . 21 92 . 59 ± 0 . 16 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 3 92 . 86 ± 0 . 24 92 . 42 ± 0 . 16 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 4 92 . 53 ± 0 . 17 92 . 08 ± 0 . 18 D - L S T M W I T H D E L AY = 5 B I - L S T M 9 3 . 88 ± 0 . 17 93 . 27 ± 0 . 06 D - L S T M W I T H D E L AY = 5 L S T M 91 . 88 ± 0 . 18 91 . 54 ± 0 . 1 1 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 1 93 . 31 ± 0 . 14 92 . 74 ± 0 . 10 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 2 93 . 17 ± 0 . 13 92 . 57 ± 0 . 17 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 3 92 . 84 ± 0 . 19 92 . 25 ± 0 . 10 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 4 92 . 50 ± 0 . 22 91 . 96 ± 0 . 19 Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k T able 5. Parts-of-Speech results for English. T he t able shows all possible combinations of delays or bidirectional LSTM networks. T he best forward-only network is marke d in bold. C H A R A C T E R - L E V E L N E T W O R K W O R D - L E V E L N E T W O R K V A L I D AT I O N A C C U R A C Y T E S T A C C U R A C Y B I - L S T M B I - L S T M 94 . 8 5 ± 0 . 05 94 . 84 ± 0 . 08 B I - L S T M L S T M 91 . 90 ± 0 . 12 92 . 0 5 ± 0 . 09 B I - L S T M D - L S T M W I T H D E L AY = 1 94 . 47 ± 0 . 06 94 . 41 ± 0 . 05 B I - L S T M D - L S T M W I T H D E L AY = 2 94 . 17 ± 0 . 13 94 . 14 ± 0 . 10 B I - L S T M D - L S T M W I T H D E L AY = 3 93 . 70 ± 0 . 07 93 . 87 ± 0 . 07 B I - L S T M D - L S T M W I T H D E L AY = 4 93 . 11 ± 0 . 14 93 . 26 ± 0 . 08 B I - L S T M D - L S T M W I T H D E L AY = 5 92 . 54 ± 0 . 16 92 . 70 ± 0 . 10 L S T M B I - L S T M 95 . 03 ± 0 . 14 94 . 99 ± 0 . 15 L S T M L S T M 92 . 05 ± 0 . 13 92 . 1 4 ± 0 . 10 L S T M D - L S T M W I T H D E L AY = 1 94 . 53 ± 0 . 08 94 . 58 ± 0 . 11 L S T M D - L S T M W I T H D E L AY = 2 94 . 29 ± 0 . 05 94 . 28 ± 0 . 05 L S T M D - L S T M W I T H D E L AY = 3 93 . 81 ± 0 . 11 93 . 85 ± 0 . 12 L S T M D - L S T M W I T H D E L AY = 4 93 . 39 ± 0 . 12 93 . 55 ± 0 . 10 D - L S T M W I T H D E L AY = 1 B I - L S T M 9 4 . 94 ± 0 . 07 94 . 95 ± 0 . 06 D - L S T M W I T H D E L AY = 1 L S T M 91 . 96 ± 0 . 16 92 . 09 ± 0 . 1 0 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 1 94 . 57 ± 0 . 08 94 . 57 ± 0 . 14 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 2 94 . 29 ± 0 . 12 94 . 37 ± 0 . 08 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 3 93 . 86 ± 0 . 05 93 . 84 ± 0 . 10 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 4 93 . 35 ± 0 . 10 93 . 56 ± 0 . 13 D - L S T M W I T H D E L AY = 3 B I - L S T M 9 4 . 98 ± 0 . 09 94 . 91 ± 0 . 10 D - L S T M W I T H D E L AY = 3 L S T M 91 . 96 ± 0 . 08 92 . 08 ± 0 . 1 0 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 1 94 . 47 ± 0 . 03 94 . 51 ± 0 . 10 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 2 94 . 21 ± 0 . 05 94 . 18 ± 0 . 03 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 3 93 . 80 ± 0 . 13 93 . 88 ± 0 . 13 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 4 93 . 23 ± 0 . 13 93 . 38 ± 0 . 11 D - L S T M W I T H D E L AY = 5 B I - L S T M 9 4 . 90 ± 0 . 07 94 . 87 ± 0 . 09 D - L S T M W I T H D E L AY = 5 L S T M 91 . 84 ± 0 . 11 91 . 98 ± 0 . 2 0 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 1 94 . 36 ± 0 . 09 94 . 44 ± 0 . 08 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 2 94 . 05 ± 0 . 07 94 . 19 ± 0 . 05 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 3 93 . 61 ± 0 . 07 93 . 76 ± 0 . 05 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 4 93 . 14 ± 0 . 04 93 . 27 ± 0 . 12 Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k T able 6. Parts-of-Speech results for French. The table sho ws all possible combination s of delays or bidirectional LSTM networks. The best forward-only network is marke d in bold. C H A R A C T E R - L E V E L N E T W O R K W O R D - L E V E L N E T W O R K V A L I D AT I O N A C C U R A C Y T E S T A C C U R A C Y B I - L S T M B I - L S T M 97 . 6 3 ± 0 . 06 97 . 22 ± 0 . 11 B I - L S T M L S T M 96 . 67 ± 0 . 05 96 . 1 5 ± 0 . 17 B I - L S T M D - L S T M W I T H D E L AY = 1 97 . 48 ± 0 . 02 96 . 98 ± 0 . 05 B I - L S T M D - L S T M W I T H D E L AY = 2 97 . 41 ± 0 . 02 96 . 91 ± 0 . 12 B I - L S T M D - L S T M W I T H D E L AY = 3 97 . 31 ± 0 . 05 96 . 84 ± 0 . 09 B I - L S T M D - L S T M W I T H D E L AY = 4 97 . 12 ± 0 . 05 96 . 61 ± 0 . 06 B I - L S T M D - L S T M W I T H D E L AY = 5 96 . 88 ± 0 . 10 96 . 20 ± 0 . 14 L S T M B I - L S T M 97 . 70 ± 0 . 07 97 . 19 ± 0 . 09 L S T M L S T M 96 . 67 ± 0 . 07 96 . 1 0 ± 0 . 11 L S T M D - L S T M W I T H D E L AY = 1 97 . 49 ± 0 . 07 97 . 03 ± 0 . 07 L S T M D - L S T M W I T H D E L AY = 2 97 . 49 ± 0 . 05 97 . 00 ± 0 . 06 L S T M D - L S T M W I T H D E L AY = 3 97 . 34 ± 0 . 04 96 . 89 ± 0 . 09 L S T M D - L S T M W I T H D E L AY = 4 97 . 16 ± 0 . 06 96 . 66 ± 0 . 15 D - L S T M W I T H D E L AY = 1 B I - L S T M 9 7 . 67 ± 0 . 07 97 . 23 ± 0 . 12 D - L S T M W I T H D E L AY = 1 L S T M 96 . 66 ± 0 . 06 95 . 97 ± 0 . 0 7 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 1 97 . 49 ± 0 . 04 97 . 04 ± 0 . 13 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 2 97 . 43 ± 0 . 05 96 . 98 ± 0 . 05 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 3 97 . 36 ± 0 . 08 96 . 80 ± 0 . 10 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 4 97 . 22 ± 0 . 06 96 . 57 ± 0 . 10 D - L S T M W I T H D E L AY = 3 B I - L S T M 9 7 . 67 ± 0 . 08 97 . 21 ± 0 . 08 D - L S T M W I T H D E L AY = 3 L S T M 96 . 67 ± 0 . 07 95 . 98 ± 0 . 1 4 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 1 97 . 52 ± 0 . 04 97 . 02 ± 0 . 09 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 2 97 . 44 ± 0 . 02 96 . 97 ± 0 . 12 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 3 97 . 28 ± 0 . 04 96 . 74 ± 0 . 07 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 4 97 . 13 ± 0 . 05 96 . 57 ± 0 . 09 D - L S T M W I T H D E L AY = 5 B I - L S T M 9 7 . 61 ± 0 . 03 97 . 12 ± 0 . 06 D - L S T M W I T H D E L AY = 5 L S T M 96 . 64 ± 0 . 06 96 . 08 ± 0 . 0 8 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 1 97 . 46 ± 0 . 02 96 . 96 ± 0 . 13 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 2 97 . 41 ± 0 . 06 96 . 87 ± 0 . 06 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 3 97 . 36 ± 0 . 05 96 . 82 ± 0 . 07 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 4 97 . 15 ± 0 . 05 96 . 51 ± 0 . 07

Approximating Stacked and Bidirectional Recurrent Architectures with the Delayed Recurrent Neural Network

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment