Approximating Stacked and Bidirectional Recurrent Architectures with the Delayed Recurrent Neural Network

Recent work has shown that topological enhancements to recurrent neural networks (RNNs) can increase their expressiveness and representational capacity. Two popular enhancements are stacked RNNs, which increases the capacity for learning non-linear f…

Authors: Javier S. Turek, Shailee Jain, Vy Vo

Approximating Stacked and Bidirectional Recurrent Architectures with the   Delayed Recurrent Neural Network
A ppr oximating Stacked an d Bidire ctional Recurr ent Archi tectures wi th the Delayed Recurr ent Neural Network Javier S. T urek 1 Shailee Jain 2 Vy A. V o 1 Mihai Capot ˘ a 1 Alexander G. Huth 2 3 Theodore L. Willke 1 Abstract Recent work has sho wn that topolog ical en- hancemen ts to recurren t neural networks (RNNs) can in c r ease their expressiv eness and representa- tional capacity . T w o p o pular enhancemen ts are stacked RNNs, which increases th e capacity fo r learning non-lin e ar functions, and bidirection al processing, which exploits acau sal infor mation in a sequence. In th is work, we explore the delayed- RNN, which is a single-layer RNN that has a d elay between the input and o utput. W e prove that a weight-c o nstrained versio n o f the delayed- RNN is equivalent to a stacked-RNN. W e also show that th e delay gives rise to par- tial acausality , mu ch like bidirection al n etworks. Synthetic experiments confirm that the delay e d- RNN can mimic b idirectiona l networks, solv ing some acausal tasks similarly , and outperfor m- ing them in others. Moreover , we sho w similar perfor mance to bidirectional netw ork s in a real- world natural la n guage pro cessing task. These results suggest that delayed- RNNs can ap p rox- imate topologies inclu ding stacked RNNs, bidi- rectional RNNs, and stacked bid irectional RNNs – but with equiv alent or faster ru ntimes for the delayed- RNNs. 1. Introduc tion Recurrent neural networks (RNN) hav e success- fully been used for sequential tasks like lan guage modeling ( Sutske ver et al. , 2 011 ), machine transla- tion ( Sutskev er et al. , 20 14 ), and speec h r ecognition ( Amodei et al. , 201 6 ). They approx im ate co m plex, non - 1 Intel L abs, Hillsboro, Oregon, USA 2 Department of Com- puter Science, The Univ ersity of T e xas at Austin, Austin, T exas, USA 3 Department o f Neuroscience, The Uni versity of T exas at Austin, Austin, T exas, USA. Correspondence to: Javier S. T urek . Pr oceedings of the 37 th International Confer ence on Machine Learning , V ienna, Austria, PMLR 108, 2020. Copyright 2020 by the author(s). linear tempo ral relationship s b y maintainin g and upd a ting an intern al state fo r every input elemen t. Howe ver , they face se vera l challenges while modeling lo ng-term depend encies, motivating work on variant architectures. Firstly , due to the long credit assignm ent paths in RNNs, the g radients might v anish o r explode ( Bengio et al. , 199 4 ). This has led to gated variants like the Long Short-term Memory (LSTM) ( Hochreiter & Schmidhube r , 1997 ) that can retain info rmation ov er long timescales. Secondly , it is well k nown that deep er networks can mo r e efficiently approx imate a bro a der r ange of function s ( Bengio et al. , 2007 ; Bianchini & Scarselli , 20 14 ). While RNNs ar e deep in time , they ar e lim ited in the number o f no n-linearities applied to recent inputs. T o in crease depth , there has been extensiv e work on stacking RNNs into mu ltiple layers ( Schmidhu ber , 1 992 ; Bengio , 20 09 ). In vanilla stacked RNNs, each layer ap - plies a non- lin earity and passes info rmation to the next layer, while also maintaining a recu rrent connection to it- self. T o effectiv ely propag ate gradients across the hierar- chy , skip or shor tcut conne c tio ns can be used ( Raiko e t al. , 2012 ; Grav es , 2013 ). Alternatives like recurrent highway networks ( Zilly et al. , 20 17 ) introd uce non -linearities be- tween timesteps thr ough “ m icro-ticks" ( Grav es , 2 016 ). Pas- canu et al. ( 2014 ) incr ease dep th by addin g feedfor ward layers between state-to-state transitions. Ga ted feedb ack networks ( Chung et al. , 2015 ) allo w for layer-to-layer inter- actions between adjacent timesteps. All these variants thus introdu c e topolog ical modifications to retain inf o rmation over longer tim escales and mo del h ierarchical temporal de- penden cies. Another development is the bidirectiona l RNN (Bi-RNN) ( Schuster & Paliw al , 199 7 ; Grav es & Schmidhub er , 20 05 ). While RNNs are inh erently cau sal, Bi-RNNs model acausal interactions by processing sequences in bo th forward and backward directions. They achieve state-of-the- a rt perfor- mance on parts-of- sp eech tagging ( Plank et al. , 2016 ) and sentiment a n alysis ( Baziotis et al. , 20 1 7 ), demonstrating that some natu r al languag e processing ( NLP) tasks benefit greatly from combining p ast and future inputs. The suc c esses of th ese RNN architectural variants seem to Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k derive from two common proper ties: depth and acau sality . In this paper we investigate the delayed-recurrent neural network ( d- RNN) , an extremely simp le variant tha t adds both depth a nd a causality to the RNN. The d-RNN is a single-layer RNN that imp oses dep th in time by delaying the outpu t of the model. W e a n alyze the d-RNN and prove that when it is co nstrained with sparse weights, the model is equiv alent to a stacked RNN. Further, noting that the de lay introdu c es acausal processing, we use a d -RNN to ap prox- imate bidirectional recurre n t n etworks. W e show empiri- cally that a d-RNN has the capab ility to solve so me tasks similarly to stacked an d b id irectional RNNs, and o utper- form th em in others. Addition ally , we sho w that e ven if the d-RNN appro ximation carries some er r or, this mod el ca n provide m u ch faster r u ntimes than alternati ves. 2. Backgroun d Giv en a sequential input { x t } t =1 ...T , x t ∈ R q , a single- layer RNN is defined b y: ˆ h t = f  ˆ W x x t + ˆ W h ˆ h t − 1 + ˆ b h  , (1) ˆ y t = g  ˆ W o ˆ h t + ˆ b o  , (2) where f ( · ) an d g ( · ) are element-wise activati on fun ction such as tanh and softmax , ˆ h t ∈ R n is the hidden state at timestep t with n un its, and ˆ y t ∈ R m is the network output. Learned param eters include input weigh ts ˆ W x , recurr ent weights ˆ W h , bias term ˆ b h , outp ut weights ˆ W o , and bias term ˆ b o . The initial hidd en state is deno te d ˆ h 0 . Stacked recu rrent units are typically used to pr ovide depth in RNNs ( Schmidhu ber , 1992 ; Bengio , 2 009 ). Based o n Eq. ( 1 ) and ( 2 ), a stacked RNN with k layers is g iv en by: h (1) t = f  W (1) x x t + W (1) h h (1) t − 1 + b (1) h  , i = 1 (3) h ( i ) t = f  W ( i ) x h ( i − 1) t + W ( i ) h h ( i ) t − 1 + b ( i ) h  , i = 2 . . . k (4) y t = g  W o h ( k ) t + b o  , (5) where the activ ation function and par ameterization follow the single - layer RNN. Separate weights and bias terms for each layer i are gi ven by W ( i ) x , W ( i ) h , and b ( i ) h . The hid den state for this lay er at timestep t is h ( i ) t . The stacked RNN has initial hidden state vectors h (1) 0 . . . h ( k ) 0 correspo n ding to the k lay ers. The hat op erator is used for vectors and matrices in th e single-layer RNN, while tho se without are for the stacked RNN. 3. Delayed-Recurr ent Neural Network One way to incr e ase dep th in RNNs is to stack r ecurren t lay - ers, as suggested ab ove. An a lternative is to con sider time Delay d = 2 h 0 x 1 h 1 x 2 h 2 x 3 h 3 y 3 z 1 x T h T y T z T-d x T+1 [NULL] h T+1 y T+1 z T+1-d x T+d [NULL] h T+d y T+d z T Figure 1. A delayed-recurrent neural network (d-RNN) process- ing a sequence of T e lements. The output is delayed by d = 2 timesteps. T he first output element is in ˆ y 3 and the last in ˆ y T + d . The input sequence has d additional elements, such as ‘[ NULL]’ symbols. During training, the outputs are compared with the T elements of the labeled sequence { z j } j . as a means to increase d epth within a single - layer RNN. Howe ver , sing le - layer RNNs are lim ited in th e num ber of non-lin e arities ap plied to recent inputs: there is a sing le non-lin e arity b etween the most recent input x t and its re- spectiv e outp ut ˆ y t . Pre viou s efforts ( Pascanu et al. , 201 4 ; Graves , 2016 ; Zilly et al. , 201 7 ) overcame this limitatio n by incor porating inte r mediate non- linearities b e tween in- put elements in different ways. These solution s a d d com- putational steps between elements in the sequence, gr eatly increasing r untime complexity . In this work, we explor e the delayed- r ecurren t n eural n etwork (d-RNN), in wh ich ef fec- ti ve depth is increa sed by introducin g a “d elay” between the input and output. Formally , we define a d- RNN to be a single-layer recu rrent neural network as in E quations ( 1 ) and ( 2 ), such th at f or any inpu t x t the resp e ctiv e o utput is obtained in ˆ y t + d , i.e., d timesteps later (Figure 1 ). W e refer to d as the “delay” of the network. The initial hidde n state, ˆ h 0 , for a d -RNN is initialized in the sam e manner as an RNN. Delaying the output requ ires special co n siderations on the data that differ slightly from an RNN. Input sequences need to have T + d elem e nts instead of T . Dep ending on the task being solved, this can be achiev ed b y adding a “nu ll” inp ut element (e.g ., the zer o vector), o r includin g d additio nal elements in the input sequence. When doing a forward pass over the d-RNN for in ference, o utputs fr om t = 1 to d are discar ded as we expect the o utput for x 1 to be at ˆ y 1+ d . The o utput sequ ence goes fr om ˆ y 1+ d to ˆ y T + d , and has T elements. T raining loss is c omputed by comp aring z t , th e expected output fo r inp ut x t , with ˆ y t + d . Th us, gradien ts are back - propag ated on ly from delayed outputs ˆ y 1+ d , . . . , ˆ y T + d . In this way , any modified r ecurren t ce ll, such as an LSTM or Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k GR U, can be traine d with delayed outp ut to obtain a de- layed version o f th e architectur e, e.g., d-LSTM or d-GR U. 3.1. Complexity Consider an RNN with n units, where inp ut eleme n ts have dimen sion q , a nd output elements have dimen - sion m . Computing one timestep of th is RNN re- quires thre e matrix-vector multiplication s with complex- ity O  nq + nm + n 2  . Apply ing the non-linear function s f ( · ) and g ( · ) requires O ( m + n ) . Hence, each step of this RNN h as r u ntime com p lexity of O  nq + nm + n 2  . F or a sequence o f length T , the overall compu tational effort is O  T ( nq + nm + n 2 )  . For a d-RNN, th e numb e r of timesteps is increased by the delay d , g i vin g total r untime complexity o f O  ( T + d )( nq + nm + n 2 )  . While the d-RNN incurs some cost, it is c heaper than alternative method s such as micro-steps ( Graves , 201 6 ; Zilly et al. , 2 017 ), wher e add itio nal timesteps are inserted between each pair of elements in both the input and outp ut sequences. The runtime complexity for each micr o-step is similar to an RNN step, le a ding the micro -step mod el co m- plexity to grow with th e num ber of micro- steps d pr opor- tionally to O ( dT ) . In contrast, th e d-RNN mo del complex- ity only grows propo rtionally to O ( d + T ) . 3.2. Stacked RNNs are d-RNNs The m athematical structure of a stacked RNN is similar to a single - layer RNN with the additio n of between -layer con- nections th at add depth. Here we show that any stacked RNN can be flattened into a single-la y er d -RNN that pro - duces the exact sequence of hid den states an d outpu ts. W e exchange the dep th from the between-lay er co n nections with tempor a l d epth applied throu gh a delay in the outp ut. T o illustrate th is, we rewrite the parameters of a single-lay er RNN using th e weights and bias terms of a k - layer stacked RNN from Equations ( 3 )-( 5 ): ˆ W h =             W (1) h 0 · · · 0 W (2) x W (2) h 0 . . . . . . . . . . . . . . . . . . W ( i ) x W ( i ) h . . . . . . . . . 0 0 · · · 0 W ( k ) x W ( k ) h             , (6) ˆ b h =     b (1) h . . . b ( k ) h     , ˆ W x =      W (1) x 0 . . . 0      , (7) ˆ W o =  0 · · · 0 W o  , ˆ b o = b o , (8) where ˆ W x ∈ R kn × q are the inpu t weights, ˆ W h ∈ R kn × kn the recurren t weights, ˆ b h ∈ R kn the biases, ˆ W o ∈ R m × kn the output weights, and ˆ b o ∈ R m the output biases. One can see fro m Eq. ( 6 )-( 8 ) that each laye r in the stacked RNN is conv erted into a grou p of u nits in the sing le- layer RNN. The bloc k bidiago nal structur e o f the recurren t weight matrix ˆ W h makes th e hidden state act as a buf fer, where each gr oup of u nits only receives input fro m itself and the previous grou p. In formatio n proc essed thro ugh this buf fering mech anism ev entually arr ives at the o utput after k − 1 timestep s. In fact, th e obtained mod el is a d-RNN with delay d = k − 1 and spar sely c o nstrained weights. Note that the d -RNN p erform s th e sam e comp utations as the stacked version by tr ading depth in layers for depth in time . Next, we define the following notation : for a vector v ∈ R kn with k blocks, the subvector v { i } ∈ R n refers to its i th block following the partition f rom Equation s ( 6 )-( 8 ). W e now prove that a d-RNN parametr iz e d b y Eq. ( 6 )-( 8 ) is ex- actly equivalent to th e stacked RNN in Eq s. ( 3 )-( 5 ). The proof can be extended to mo re comp lex recurrent cells. W e include a proof f or LSTMs in the supplementary material. Theorem 1. Given an inp ut sequ ence { x t } t =1 ...T and a stack ed RNN with k layers d efined b y Equation s ( 3 ) - ( 5 ) with activatio n fun ctions f ( · ) an d g ( · ) , a nd in itial states { h ( i ) 0 } i =1 ...k , the d-RNN with delay d = k − 1 , defined b y Equation s ( 6 ) - ( 8 ) and initialized with ˆ h 0 such that ˆ h { i } i − 1 = h ( i ) 0 , ∀ i = 1 . . . k , pr oduces the same outpu t sequence but delaye d by k − 1 timesteps, i.e., ˆ y t + k − 1 = y t for a ll t = 1 . . . T . Further , the sequ ence of hidd en states at e a ch layer i ar e equ ivalent with delay i − 1 , i.e., ˆ h { i } t + i − 1 = h ( i ) t for all 1 ≤ i ≤ k and t ≥ 1 . Pr oo f. See Section 1 o f the supplementary material.  Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k Theorem 1 m akes an assump tion that ˆ h 0 in the d-RNN can be initialized such that it a c hieves ˆ h { i } i − 1 = h ( i ) 0 for all block s. Lemm a 1 below implies that initialization fo r the d- RNN with con strained we ig hts can always be com- puted f rom the stacked RNN. The intu ition beh ind it is that we can comp ute recursively from ˆ h { i } i − 1 = h ( i ) 0 to ˆ h { i } 0 for block i , while in verting th e ac ti vation fu nction. All com- monly used activation f unction s are surje c tive, th us it is enoug h to know the right-inverse of the activation func tio n f ( · ) ( see pro of of Lemm a). For example, when f ( · ) is the ReLU, the right-inverse is the identity func tion r ( d ) = d . Lemma 1. Let f : R → D be a surjective a ctivation function that maps elements in R to elements in interval D . Also, let h ( i ) 0 ∈ D n for i = 1 . . . k b e the hidden state initialization for a stacked RNN with k layers as d efined in ( 3 ) - ( 4 ) . Then, there exists an initial hidd en state vector ˆ h 0 ∈ R kn for a single- layer network in Equ ations ( 6 ) - ( 7 ) such that ˆ h { i } i − 1 = h ( i ) 0 ∀ i = 1 . . . k . Pr oo f. See Section 2 o f the supplementary material.  From this theor em we see that k -layer stacked RNNs can be p erfectly e xp r essed as a single-lay er d-RNN. In th is case, the d-RNN has a specific sparsity structure in its weig ht ma- trices that is not present in the generic RNN or d-RNN. As the stacked RNN an d the d-RNN with sparsely c onstrained weights models ar e equ i valent, ther e is no difference in fa- vor of which one is used in practice, an d their r untime com- plexities are the same 1 . Moreover , they are interch angeable using the weight matrix d efinitions in Equations ( 6 )-( 8 ). 3 . 2 . 1 . R E L A T I O N T O O T H E R T O P O L O G I E S Suppose one takes a weight con strained d-RNN an d adds non-ze r o elements to region s not popu lated by weig hts in Eq. ( 6 ). Th ese no n -zero weig hts do n ot corresp ond to ex- isting co nnections in the stacked RNN. So what do they correspo n d to? T o explore this q uestion we illustrate a 4-layer stacked RNN in Figure 2 (a ) . Here, solid arrows show the stan - dard stacked RNN connectio ns. The d-RNN we ight m a tr i- ces ˆ W h , ˆ W x , and ˆ W o are shown in Figure 2 (b), wh e re the colo r of each block match es the correspo nding arrow in Figu r e 2 (a) . Blocks on the main diagon al of ˆ W h con- nect group s o f units to themselves rec u rrently , wh ile blo cks on the sub diagon al correspo nd to connectio ns between lay- ers in th e stacked RNN. More g enerally , block ( i, j ) in ˆ W h correspo n ds to a con nection fro m h ( j ) t to h ( i ) t + j − i +1 in the stacked RNN. Thus, blocks in the lower triang le 1 Their r untime complex ities are the same as we can always obtain a version w i th reduced comp utational ef fort f or one model by exe cuting the other and translating the result. (i.e. i > j + 1 ) correspo nd to connection s that p oint b ack- wards in time, and fr o m a lo wer layer to a higher layer . For example, the oran ge block (3 , 1) in Figure 2 (b) (and the dashed orange lines in Figur e 2 ( a)) connects layer 1 at time t to layer 3 at time t − 1 . Conversely , block s in the uppe r triangle (i.e. j > i ) point forward in time and from a h igher layer to a lower layer . For example, the red b lock (3 , 4) in Figur e 2 (b) (and th e dashed red lines in Figure 2 ( a)) connects layer 4 at time t to layer 3 at time t + 2 . Thus we see that addin g weigh ts to em pty r egion s in the weight constrained d- RNN can mim ic the beh avior of many stacked recurre n t architecture s that have p reviously been pro posed. Amo ng o thers, it can approx imate the In- dRNN ( Li et al. , 201 8 ), td-RNN ( Zhang et al. , 2016 ), skip- connectio ns ( Graves , 20 1 3 ), and all-to-all lay er network s ( Chung et al. , 20 15 ). Simp ly rem oving the con straints on ˆ W h during training will enable a d-RNN to learn the nec - essary stacked a r chitecture. Howe ver , u nlike an ordinary RNN, th is req uires the o utput to b e delay e d based on th e desired stacking dep th. Furth er , wh ile the sing le-layer net- work has the same total n u mber of units as the corr espond- ing stacked RNN, relaxing co nstraints on ˆ W h would mean that the single-layer w ou ld have many more parameters. 3.3. A pproximating Bidir ectio na l RNNs W e p reviously showed how a d-RNN can be mad e equ iv- alent to a stacked RNN by c onstraining its weig ht ma- trices. W ithout these constraints, the d-RNN h as the ability to peek at “futu re” inputs: it comp u tes th e de- layed outpu t for time t at ˆ y t + d using also the inpu ts x t +1 , . . . , x t + d that are beyon d timestep t . A similar idea was used in the past as a baseline for bidir ectional recurrent neural networks (Bi-RNNs) ( Schuster & Paliw al , 1997 ; Graves & Schmidhub er , 2 005 ). These paper s showed th at Bi-RNNs were super ior to d- RNNs fo r r elativ ely simple problem s, but it is not clear that this c omparison holds true for pro blems that re q uire mor e n on-linea r solutio n s. If a recurren t network can compu te the o utput for time t by exploiting future input elem ents, what cond itions ar e nec- essary to approx imate its Bi-RNN coun terpart? Mo reover , can the d-RNN obtain th e same re sults? An d, g iv en these condition s, is there a benefit to using the d-RNN in stead of the Bi-RNN? Figure 3 shows the numb e r of non -linear transf o rmations that each network can apply to any input element b efore computin g the outp ut at timestep t 0 . The generic RNN pro- cesses only past inp uts ( t ≤ t 0 ), and the num ber o f non- linearities dec reases for inpu ts closer to timestep t 0 . Th e Bi-RNN has identical beh avior for causal in puts but is au g- mented symm etrically for acausal inputs. I n con trast, the d-RNN has similar behavior for the causal inputs but with a higher number of non-linearities. This trend continues for Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k (a ) (b) Figure 2. A stack ed RNN is equiv al ent to a single-layer d-RNN under the gi ven sparse weight constraints. T he d-RNN prod uces the same representations as the stacked network. (a) Stacked RNN with k = 4 layers where connections show the dif ferent weight parameters. (b) W eights of the d-RNN that are equi valent to connec tions in t he stacked RNN. 1 d+1 Number of non-linearities Timestep relative to current input ( Δ t) max(1-Δt, 1+Δt) RNN d-RNN (d+1 layer stacked RNN) Bi-RNN d+1-Δt d-RNN has more non-linearities Bi-RNN has more non-linearities Figure 3. Number of non-linearities that can be applied to past and future sequence elements wit h respect to current input ( ∆ t =0). The d-RNN only sees d steps i nto the future. the first d a c ausal inp u ts with a decr easing num ber of non- linearities until th e numb er reache s zero at t = t 0 + d + 1 . In order for a d-RNN to have at le a st as many non -linearities as a Bi-RNN fo r every elemen t in a sequence, it would need a d elay that is twice the sequence length. Howe ver , a d- RNN co uld beat a Bi-RNN when th e non-lin ear influence of nearb y acausal inpu ts on the learn ed func tio n is larger than elements farther in the f uture. In these cases, stacking Bi-RNNs would b e needed to ac hieve the same objectiv e. Using a d-RNN to app roximate a Bi-RNN c a n also d e- crease com putation a l cost. For a sequen ce of leng th T , a stacked Bi-RNN needs to compute both f orward and b ack- ward RNNs fo r each lay er b efore it can co mpute the n ext one. This synchron ization requirement h inders par a lleliza- tion an d increases ru ntime. In co n trast, the for ward-pass for the d-RNN takes T + d steps, but does n ot suffer from synchro n ization. Th us in high ly parallel hardware su ch as CPUs and GPUs, the run time of a k -la y er stacked Bi- RNN should be at least k times slower th an an RNN or d-RNN. Beyond computation al costs, d-RNNs can also be used where it is critical to output values in ( near) realtime applications ( Guo et al. , 2016 ; Ar ik et al. , 20 17 ). A d-RNN requires only the last d elements and a hidden state to co m- pute a new value, wh ereas b idirectional ar chitectures need to process an entire backward pass of the sequence. 4. Experiments W e test the capab ilities of the d-RNN in fo ur experiments designed to shed mo re ligh t o n the relationship s between d-RNNs, RNNs, Bi-RNNs, an d stacked networks. For this purpo se, the RNN imple m entation we use is a LSTM net- work, which av oids v anishing gradients an d retains more in- formation over long period s. The delay e d LST M networks are deno te d as d-L STMs. T o train each d-LSTM , the input sequences are padded at the end with zero-vector s and loss is comp u ted b y ignor ing th e first “d elay” timesteps, as ex- plained in Section 3 . All mo dels ar e trained u sing th e Adam optimization algo rithm ( Kingma & Ba , 201 5 ) w ith learn- ing rate 0 . 001 , β 1 = 0 . 9 , an d β 2 = 0 . 999 . Durin g train- ing, th e gradien ts are clipped ( Pascanu e t al. , 2013 ) at 1.0 to av oid explosion s. Exp e riments were implemented using PyT o rch 1.1.0 ( Paszke et al. , 2017 ), and code can be found at http://www.a nonymous.com /anonymous . 4.1. Sequence Rev ersal First, we propose a simple test to illustrate how the d- LSTM can interpo late between a regular LSTM and Bi- LSTM. In this test we r equire the r ecurren t architectu res to output a sequ ence in reverse order while r e a ding it, i.e. y t = x T − t +1 for t = 1 , .., T . Solv ing th is task per- Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k fectly is o nly possible when a network has acausal access to the sequenc e . Mo r eover , depend ing on h ow many acausal elements a network can access, it is possible to analytically calculate th e expected maximum performan ce th at th e net- work can achieve. Gi ven a sequ ence of length T with e le - ments from a vocabulary { 1 , ..., V } , a causal network such as th e regular L STM can outp ut the second half of the ele- ments correctly and guess th ose in the first half with prob- ability 1 / V . When a network has access to d acausal ele- ments it can star t outputting correct ele m ents befor e reac h- ing the h alfway po int, and can achieve an expected tru e pos- iti ve r a te (TPR) o f 1 2  1 + 1 V  +  d +1 2  1 T  1 − 1 V  . W e gen- erate d a ta sequences of leng th T = 20 b y un iformly sam- pling integer v alues between 1 and V = 4 . T he tr aining set consists of 10,000 sequ ences, the v alidation set 2,0 00, and test set 2,000. Outpu t sequenc e s are the input sequen ces re- versed. V alues in the input sequen ces a r e f ed as one-hot vector represen tations. All networks outp ut via a linear layer with a softmax function that co nverts to a vector of V pr obabilities to wh ich c r oss-entropy loss is app lied. The LSTM and d-LSTM networks have 10 0 hidden units, while the Bi-LSTM has 70 in each direction in or der to keep the total numb er of param eters con stant. W e use ba tches of 100 sequences an d train for 1 ,000 epoc h s with early stop- ping after 10 epochs an d ∆ = 1 e-3. Figure 4 shows accuracy on this task as a fu nction of the applied delay . Th e LSTM does not use acausal informa - tion and is unable to reverse mo re than h a lf o f the in put sequence. Con versely , the Bi-LSTM has f ull access to ev- ery element in the sequ ence, and can p erfectly solve th e task. For the d -LSTM n etwork, perf o rmance increases as we incr ease the de lay in the output, reaching the same lev el as the Bi-LSTM o nce the network h as access to the en tire sequence before being r equired to produce a ny output (de- lay 19). T his experiment de monstrates th at the d-LSTM can “interp olate” b e twe en LSTM a nd Bi-LSTM by choos- ing a delay that range s between zero and the length of the input sequence. 4.2. Evaluating Network Capabilities The first exp e riment showed h ow a d-LSTM with suffi- cient delay can mim ic a Bi-LSTM. In the n ext experimen t we aim at com paring how well d- L STM, LSTM, a n d Bi- LSTM n e tworks app r oximate fun c tio ns with varying de- grees of non-linear ity and acausality . Drawing inspir ation f r om ( Schuster & Paliw al , 1997 ), we require ea c h recu rrent network to lear n the fun c tion y t = sin( γ P a j = − c +1 w j + c x t + j ) , where w is a linear filter . T he parameter γ scales the argu ment o f the sine fu nction and thus controls the degree of n o n-linear ity in the f unction : for small γ the fu nction is ro ughly linear , while for large γ the function is h ighly non-lin ear . Integers a ≥ 0 (a causal) and 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 De l a y d 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Ac c u r a c y d - L S T M v a l d - L S T M t e s t M a x Pe r fo r m a n c e L S T M Bi L S T M Figure 4. Comparison of different delay value s for a d-LSTM net- work for rev ersing a sequence. LST M and Bi-LSTM networks are sho wn for reference. The netw ork is capable of achie ving the expec ted statistical boun d. The d-LSTM with highest delay is ca- pable of solving the task as well as the Bi -LSTM. c ≥ 0 (causal) co ntrol the length of the causal and acausal portion s of the linear filter w that is applied to the in put x . W e gen erate datasets with different comb inations of γ ∈ [0 . 1 , . . . , 5 . 0] and a ∈ [0 , . . . , 10] , choosing c such th at a + c = 20 . For e ach co m bination, we generate a filter w with 20 elemen ts drawn unifo rmly in [0 . 0 , 1 . 0) , and ran- dom input seque n ces with T = 50 elemen ts dr awn from a unifor m distribution [0 . 0 , 1 . 0) . In total, there are 10 ,000 generated sequences for tra in ing, 2,000 for validation, and 2,000 for testing with each set of parameter values. The o ut- put is computed following the previous formu la a n d with zero p adding for the bor ders. W e generate 5 rep etitions o f each dataset with different filters w and inpu ts x . W e train LSTM, d-L STM with delay s 5 and 10, and Bi- LSTM networks to min imize mea n squared error ( MSE). The LSTM and d-LSTM have 10 0 hidden units and the Bi- LSTM has 70 p er network, match ing the nu mbers o f pa- rameters. A linear layer after the recurren t layer outp uts a single value per timestep. Models are train ed in batches of 100 sequ ences for 1 , 000 epo chs. Training is stopp ed if the validation MSE falls b elow 1e-5. T rain in g is rep e ated five times for each ( γ , a ) value. Figure 5 shows the average test M SE for each m odel as a function of γ (degree of input non-lin earity) and a (acausal- ity). LSTM perfo rmance (Fig. 5 (a) ) is poor ev erywh ere except wher e the filter is pur ely cau sal. Surprisingly , the network per forms q u ite well e ven when the amount of non - linearity ( γ ) is quite h igh. Th e reason fo r this seems to b e that tempor al depth enables the LST M to approxim ate this function well. Bi-LSTM p erform ance (Fig. 5 (b)) follows a similar tren d fo r the causal case ( a = 0 ) a s the f orward LSTM, but also has good per forman ce for acausal filters ( a > 0 ) when th e function is nearly linear ( γ is small). As the non-lin earity of the functio n in creases, howe ver , Bi- LSTM perfo rmance suffers. T h is o c curs because the Bi- LSTM nee ds to appro x imate a highly non-lin ear func tio n Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 0 2 4 6 8 1 0 F i l t e r Ac a u sa l i t y a 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 −4 . 0 −3 . 2 −2 . 4 −1 . 6 −0 . 8 0 . 0 l o g 1 0 M S E S c a l e γ (a) LS TM (b) Bi-LS TM (c) d-LSTM (d=5) (d) d-LST M (d=10) Figure 5. Error maps for the sine function experiment with different degrees of non-linearity (horizontal axis) and amoun ts of acausality of the filt er (vertical axis). T ested architectures: (a) LSTM, (b) Bi-LS TM, (c) d-LSTM wit h delay=5, and (d) d-LSTM with delay=10. Dark blue regions depict perfect filtering (l o w error), transitioning to yellow regions with high error . with a linear com bination of its forward and backward out- puts, wh ich cannot be done with small er ror . I m proving perfor mance would require stacked Bi-LSTM laye rs. In contr a st, d-LSTM networks have excellent perfo rmance for bo th non-linea r and aca usal functio n s. The d-LSTM with delay 5 (Fig. 5 (c)) shows a clear switch in perfor- mance from acausality a = 5 to 6 . This perfectly m a tches the limit of acausal elem e nts th at the network has access to. For the d-L STM with delay 10 (Fig. 5 (d)), the network perfor ms well for acausality v alues a up to 10. An interesting outcome of this experiment is th e better per- forman ce observed f or th e d-LSTM over the Bi-LSTM. This shows tha t the d-L STM can b e a better fit than a Bi- LSTM for the right task. Furthermore, the d-LSTM net- work seems to app roximate the function ality o f a stacked Bi-LSTM b y appro ximating highly non -linear func tio ns. In prac tice , this could be a great be n efit f or app lica tions where there is no need to treat the who le seq uence. More- over , this could b e impossible in other cases, such as streamed data. In such cases, the d -LSTM wou ld shine over bidirection al a rchitectures. On the other hand , we expect the Bi-LSTM to perfor m be tter when the acausality n eeds for the task are lo n ger than the d elay , i.e., a > d . 4.3. Masked Character -Level Language Modeling Next we examined a languag e task which sho uld benefit from acausal info rmation, masked character-le vel lang u age modeling . This task is adap ted from p revious work in train- ing bidirectiona l languag e mo dels ( Devlin et al. , 2019 ). T o generate masked sequ ences, we random ly replace each character with a ma sk to ken ( ‘[MASK]’) with 20% pro b- ability . The task of the network is to predict the correct character when it en counters a mask token. Because each sequence co ntains multiple m ask tokens, th e network will need to fill in som e mask tokens conditio ned on an in- put sequence that already con tains one or more mask to- kens. This can be thought of as a signal reconstru ction task: when sequ ential inp uts are rand omly degraded, how well can the network recover th e true signal? Acau sal infor- mation clearly helps with this re c o nstruction . For example, the missing letter in the seque nce “hik[MASK]n g” is ea sier to predict than th e sequence “hik[MASK]”. W e used text 8 , a clean 100M B samp le of English W ikipedia text ( Mahoney , 200 6 ) which consists of 2 7 char- acters (the English alp habet an d space s). The inp ut data contained an extra 28th mask char acter . The se 2 8 charac- ters wer e mapped to an in put em bedding layer o f dim en- sion 10. The outpu t laye r was independen t of the in put em- bedding , and only co nsisted of the 27 non-ma sk characters. Follo wing p revious work ( Mikolov et al. , 2 012 ), the first 90M character s for m ed the training set, the n ext 5M the validation set, an d the last 5M the test set. All models wer e trained with a sequ ence len gth of 180 char a cters, in min i- batches of 128 seq uences for a total o f 20 epochs. Su c cess on the task is measure d by calculating bits-per-character (BPC) f or the m ask tokens o nly . W e me a su red forward- pass runtimes on a N v idia T itan V GPU and report a verage time to process a m ini-batch. The results are summa r ized in T able 3 . As expected , the stacked Bi-LSTMs achieve the lowest BPC. Howev er, as th e nu mber of layers increases, the inferenc e runtime also increases beca use o f the syn chroniza tio n needed be- tween layers. Notably , d -LSTMs with interm e diate de lay s achieve a BPC that is within 5 % of the Bi-LSTM with at least 4 × faster run time. Since all of the d - LSTMs have a single layer, in ference runtim e remains constant as the d e- lay an d the ca p acity o f these n etworks incr e a ses. W e find similar results fo r other network capacities (see supp le m en- tary material). Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k T able 1. Performance of dif ferent networks on the masked character-lev el language modeling task in bits per character (BPC); lower is better . Mean and standard de viation v alues are computed ov er 5 repetitions of trai ning and inference runtime on the test set. M O D E L L A Y E R S D E L AY U N I T S / L AY E R P A R A M S . V A L . B P C T E S T B P C R U N T I M E L S T M 1 - 1 0 2 4 4 2 7 1 4 1 1 2 . 003 ± 0 . 003 2 . 075 ± 0 . 002 3 . 44 ms ± 0 . 09 L S T M 2 - 5 9 4 4 2 8 3 6 4 1 2 . 015 ± 0 . 005 2 . 087 ± 0 . 0 05 4 . 93 ms ± 0 . 13 L S T M 5 - 3 4 3 4 2 7 2 3 7 2 2 . 091 ± 0 . 016 2 . 155 ± 0 . 0 14 17 . 22 ms ± 0 . 62 B I - L S T M 1 - 7 2 2 4 2 7 8 8 7 9 0 . 977 ± 0 . 004 1 . 037 ± 0 . 0 04 4 . 97 ms ± 0 . 07 B I - L S T M 2 - 3 6 3 4 2 7 7 1 7 3 0 . 633 ± 0 . 003 0 . 677 ± 0 . 002 13 . 72 ms ± 0 . 31 B I - L S T M 5 - 2 0 2 4 2 8 7 1 5 1 0 . 637 ± 0 . 003 0 . 677 ± 0 . 0 04 29 . 18 ms ± 0 . 23 D - L S T M 1 1 1 0 2 4 4 2 7 1 4 1 1 1 . 332 ± 0 . 001 1 . 390 ± 0 . 0 01 3 . 29 ms ± 0 . 22 D - L S T M 1 5 1 0 2 4 4 2 7 1 4 1 1 0 . 708 ± 0 . 005 0 . 755 ± 0 . 0 04 3 . 39 ms ± 0 . 08 D - L S T M 1 8 1 0 2 4 4 2 7 1 4 1 1 0 . 662 ± 0 . 002 0 . 706 ± 0 . 003 3 . 3 6 ms ± 0 . 08 D - L S T M 1 10 1 0 2 4 4 2 7 1 4 1 1 0 . 666 ± 0 . 004 0 . 709 ± 0 . 004 3 . 5 6 ms ± 0 . 10 T able 2. Parts-of-Speech performance for German, English, and French languages. The models are composed of two subnetw orks at character-le vel and word-le vel. Best bidirectional network and best forwa rd-only netw ork are marked in bold for each language. L A N G U AG E C H A R - L E V E L N E T W O R K W O R D - L E V E L N E T W O R K V A L I D AT I O N A C C U R A C Y T E S T A C C U R A C Y L S T M L S T M 92 . 05 ± 0 . 16 91 . 58 ± 0 . 11 G E R M A N D - L S T M D E L AY = 1 D - L S T M D E L AY = 1 93 . 48 ± 0 . 31 92 . 87 ± 0 . 24 D - L S T M D E L AY = 1 B I - L S T M 93 . 93 ± 0 . 06 93 . 39 ± 0 . 18 B I - L S T M B I - L S T M 93 . 88 ± 0 . 13 93 . 15 ± 0 . 08 L S T M L S T M 92 . 05 ± 0 . 13 92 . 14 ± 0 . 10 E N G L I S H D - L S T M D E L AY = 1 D - L S T M D E L AY = 1 94 . 57 ± 0 . 08 94 . 57 ± 0 . 14 D - L S T M D E L AY = 1 B I - L S T M 94 . 94 ± 0 . 07 94 . 95 ± 0 . 06 B I - L S T M B I - L S T M 94 . 85 ± 0 . 05 94 . 84 ± 0 . 08 L S T M L S T M 96 . 67 ± 0 . 07 96 . 10 ± 0 . 11 F R E N C H D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 1 97 . 49 ± 0 . 04 97 . 04 ± 0 . 13 D - L S T M W I T H D E L AY = 1 B I - L S T M 97 . 67 ± 0 . 07 97 . 23 ± 0 . 12 B I - L S T M B I - L S T M 97 . 63 ± 0 . 06 97 . 22 ± 0 . 11 4.4. Real-W orld Part- of-Speech T agging In the previous experiments, we show tha t d-LSTM is capa- ble of appro x imating and even o utperfo rming a Bi-LSTM in some cases. In practice, however , the elem ents in a se- quence may have different forward an d back ward relations. This poses a challeng e for delay e d network s th at are con- strained to a specific d elay . If the delay is too low , it m ay not b e enou g h for some lo ng depend encies between ele- ments. If it is too high, the network may forget information and require higher capacity (an d maybe training d a ta). This is pr ev alent in several NLP tasks. T h erefor e we compare the p erforma nce of the d-L ST M with a Bi-LSTM on an NLP task wher e Bi-LSTMs achiev e state- o f-the-ar t perfor- mance, the Part-o f -Speech (POS) tagg in g task ( Ling et al. , 2015 ; Ballesteros et al. , 2015 ; Plank et al. , 2016 ). Th e task in volves pro c essing a variable length sequen ce to predict a POS tag (e.g. No un, V erb) p er word, using th e Un iv er- sal Depend encies (UD) ( Nivre et al. , 201 6 ) dataset. M ore details can be found in the sup plementar y material. The dua l Bi-LSTM ar chitecture propo sed by Plank et al. ( 2016 ) is followed to test the approximatio n cap acity of the d-LSTMs. In th is mo del, a word is en coded using a com- bination of word embedding s and character-level enc o ding. The enco d ed word is fed to a Bi-LSTM followed by a lin- ear lay er with softmax to pr oduce POS tags. Th e character- lev el encod ing is pr oduced by fir st compu ting th e emb ed- ding of each character and then feeding it to a Bi-LSTM. The last h idden state in each direction is co ncatenated with the word embedding to fo rm the ch aracter-le vel enco d ing. The charac te r-level Bi-LSTM has 100 u nits in each direc - tion and the LSTM/d-LSTMs hav e 200 units to generate encodin g s of the same size. For the word-level subnetwork, the hid den state is of size 18 8 for the Bi-LSTM, and 300 units for the LSTM/d-LSTM to matc h the numb er of pa- rameters. The networks ar e trained for 20 epochs with cross-entro py loss. W e train comb inations of networks with delays 0 ( LSTM), 1, 3 , a n d 5 f o r the ch aracter-le vel sub- network, and delay s 0 thro ugh 4 f or the word-level. Each network has 5 r epeats with random initializatio n. Results are presented in T ab le 2 . For br evity , we includ e a subset of the comb inations for eac h language (the comp lete table can be fou n d in th e supplemen tary material). For the Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k character-lev el model, LSTMs without delay yield reduced perfor mance. Howev er, replacing o n ly the char a cter-le vel Bi-LSTM with a LSTM does not affect th e p erform ance (suppleme n tary mater ial). This suggests that only the w ord - lev el subnetwork benefits from acausal eleme n ts in the sen- tence. I nterestingly , using a d-LSTM with delay 1 fo r the character-lev el network ach iev es a small improvement over the dou ble-bidir ectional mod el in English and Ger m an. Re- placing the word -level Bi-LSTM with an LSTM decreases perfor mance significantly . Howe ver, using even a d-LSTM with delay 1 improves perfo rmance to within 0 . 3% of th e original Bi-LSTM model. 5. Conclusions In th is paper we analyze the d-RNN, a sing le layer RNN where the o u tput is delayed relati ve to the input. W e show that this simple mo dification to the classical RNN adds both depth in time an d acau sal p r ocessing. W e prove th at a d - RNN is a supe rset of stacked RNNs, which are freq uently used for sequen ce problems: a d-RNN with output delay d and specific co nstraints on its weights is exactly equiv alent to a stacked RNN with d + 1 layers. W e also show that the d-RNN can app roximate bidirec tional RNNs and stacked bidirection al RNNs because the delay allo ws the model to look at futur e as well as past in p uts. In sum, w e fou nd that d-RNNs are a simple, elegan t, an d computation ally ef- ficient alternative that captur e s many of the best features of different RNN architec tu res wh ile a voiding many down- sides. Refer ences Al-Rfou’, R., Perozzi, B., and Skie n a, S. Polyglot: Distributed word repr esentations for mu ltilingual NLP. In Pr oceed ings of the S eventeenth Con fer - ence on Comp utationa l Natural La nguage Learn- ing , pp. 183– 192, Sofia, Bulgaria, Aug ust 2013. Association for Comp u tational Lin guistics. UR L https://www. aclweb.org/a nthology/W13- 3520 . Amodei, D. , Anantha n arayana n , S., An ubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanza r o, B., Cheng, Q., Chen, G., et al. Deep speech 2 : End -to-end speech reco gnition in english an d mand arin. In Inter- nationa l confer ence on machine learn in g , pp. 173 –182, 2016. Arik, S. O. , Chrzanowski, M., Coates, A., Diamos, G., Gib - iansky , A., Kang, Y ., Li, X . , Miller, J., Ng, A., Raiman, J., Sengupta, S., and Shoeybi, M. Deep v oice: Real-time neural text-to-speech. In Pr oceedin gs of the 34 th In - ternational Con fer ence o n Machine Learning - V olume 70 , ICML ’17, pp . 19 5–204 . JMLR.org, 20 17. URL http://dl.ac m.org/citati on.cfm?id=3305381.3305402 . Ballesteros, M., Dy er , C., an d Smith , N. A. Improved transition-b ased parsing by modeling c haracters in- stead of w ord s with LSTM s. In Pr oceedin gs o f the 2 015 Con fer ence on Emp irical Metho ds in Nat- ural La n guage Pr ocessing , pp. 349 –359, Lisbon, Portugal, Septemb er 2015 . Association for Computa- tional Lin guistics. d oi: 10. 1 8653 /v1/D15- 1 041. URL https://www. aclweb.org/a nthology/D15- 1041 . Baziotis, C., Pelek is, N. , a n d Do ulkeridis, C. Datastorie s at seme val-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment an alysis. In Pr oc eedings of th e 11th interna tional workshop on se- mantic evaluation ( S emEval-2 0 17) , pp. 747–75 4, 2017 . Bengio, Y . Learning deep architecture s for ai. F oun- dations and T r en ds in Machine Learning , 2(1):1 –127, 2009. ISSN 19 35-82 37. doi: 10 .1561 /22000 00006. URL http://dx.do i.org/10.156 1/2200000006 . Bengio, Y ., Simard, P ., and Fr a sconi, P . Learnin g long-term depend encies with gradien t descent is d ifficult. IEEE T ransactions on Neural Networks , 5(2) :157–1 66, March 1994. ISSN 104 5-922 7. doi: 1 0 .1109 /72.27 9181. Bengio, Y ., LeCun, Y ., et al. Scaling le a rning algorith ms tow ards ai. Lar ge-scale kernel machines , 3 4(5):1 – 41, 2007. Bianchini, M. and Scar selli, F . On the co mplexity o f neu- ral network classifiers: A co mparison between shallow and deep architectu res. I EEE T ransaction s on Neural Networks and Learning Systems , 2 5(8):1 553–1 565, Au g 2014. I SSN 2 162- 2 37X. d oi: 10 .1109 /TNNLS.201 3 . 22936 37. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y . Gated feedback recurrent neu ral network s. In P r oce e dings of the 3 2Nd International Confer ence on Interna- tional Confer ence on Machine Learning - V olume 37 , ICML ’15 , p p . 2067– 2 075. JMLR.org, 20 15. URL http://dl.ac m.org/citati on.cfm?id=3045118.3045338 . Devlin, J., Chang, M .-W ., Lee, K., and T outan ova, K. BER T: Pre- training of Deep Bidirec- tional T ransfo rmers for Langu age Under stand- ing. arXiv:1810. 04805 [cs] , May 201 9. URL http://arxiv .org/abs/181 0.04805 . Graves , A. Ge n erating sequen ces with recu rrent neur al net- works. CoRR , abs/13 0 8.085 0, 2013. Graves , A. Ada ptiv e co mputation time for recurren t neural networks. arXiv pr eprint arXiv:160 3.089 83 , 2016. Graves , A. an d Schmidhu ber, J. Framewise p honem e classification with bidirectional ls tm and other neural netw ork architectures. Neural Networks , Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k 18(5) :602 – 610, 20 05. ISSN 08 93-6 0 80. doi: https://doi.org/1 0 .1016 /j.neunet.2005.06.042. URL http://www.s ciencedirect .com/science/article/pii/S0893608005001206 . IJCNN 2005. Guo, T ., Xu, Z., Y ao , X., Chen, H., Aberer, K., and Funaya, K. Robust online time series pred ictio n with recur r ent neural networks. In 201 6 I EEE I nternation a l Confer- ence o n Da ta Science an d Ad vanced Ana lytics (DSAA ) , pp. 816– 8 25, Oct 2016. d oi: 10 .1109 /DSAA.2016 .92. Hochreiter, S. and Schmidh uber, J. Long sh ort-term memory . Neural Computation , 9(8):1735– 1780 , 1997. doi: 10.116 2/neco. 1997.9.8.1735. UR L https://doi. org/10.1162/ neco.1997.9.8.1735 . Kingma, D. P . and Ba, J. Adam: A method fo r stochastic optimization . In 3rd Internation al Conference on Learn- ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 201 5, Con fer ence T rac k Pr oceedin gs , 20 15. URL http://arxiv .org/abs/141 2.6980 . Li, S., Li, W . , Cook, C., Zhu, C., an d Gao, Y . Ind ependen tly recurren t neural network (indrnn): Building a longe r and deeper rnn. In The IEEE Conference on Computer V ision and P attern Recognition (CVPR) , June 2018. Ling, W ., Dyer , C., Black, A. W ., Trancoso, I., Fermand ez, R., Amir, S., Maru jo, L., an d Luis, T . Finding fu n ction in fo r m: Compositional chara c ter mo dels for op en vocab ulary word representatio n. I n Pr oceeding s of the 2 015 Con fer ence on Emp irical Metho ds in Nat- ural Language P r ocessing , p p. 1520 –1530 , Lisbo n, Portugal, Septemb er 2015 . Association for Computa- tional Lin guistics. d oi: 10. 1 8653 /v1/D15- 1 176. URL https://www. aclweb.org/a nthology/D15- 1176 . Mahoney , M. Relationship of W ikipedia T ext to Clean T ext, June 2006. URL http://mattm ahoney.net/d c/textdata.html . Mikolov , T . , Sutskever , I., Deoras, A., L e, H.-S., and K ombrink , S. Subword lan guage model- ing with neural n etworks. Pr e print , 2012 . URL http://www.f it.vutbr.cz/ ~imikolov/rnnlm/char.pdf . Nivre, J., d e Marneffe, M.-C., Ginter, F ., Goldberg, Y ., Haji ˇ c, J., Man ning, C. D., McDo nald, R., Petrov , S., Pyysalo, S., Silveira, N., Tsarfaty , R., and Zeman, D. U n iv ersal depend encies v1: A mul- tilingual treeban k collection. In Pr oceeding s of the T enth Intern a tional Confer e nce on Language Resour ces and Evalu ation (LREC’1 6) , pp. 1659– 1666, Portor ož, Slovenia, May 2016 . European Languag e Resources Association (ELRA). URL https://www. aclweb.org/a nthology/L16- 1262 . Pascanu, R., Mikolov , T ., and Bengio, Y . On th e diffi culty of training recu rrent neural networks. I n Pr oceed ings of the 30th Internation al Con fer ence on International Con- fer ence on Machine Learning - V olume 2 8 , ICML ’1 3, pp. III–13 10–II I–1318. JMLR.org, 2013 . URL http://dl.ac m.org/citati on.cfm?id=3042817.3043083 . Pascanu, R., Gulceh r e, C., Cho, K. , and Ben gio, Y . How to construct deep recurrent n eural networks. In Pr oceed - ings of the Seco nd In ternational Co n fer ence on Learning Repr esentatio n s (ICLR 201 4) , 2014. Paszke, A., Gross, S., Chintala, S., Cha n an, G. , Y ang, E., DeV ito, Z., Lin, Z., Desmaison, A. , Antiga, L., and Lerer, A. Autom atic differentiatio n in PyT orch. I n NIP S W orkshop o n the future of gradient-based machine learn- ing softwar e & techniques , 201 7. Plank, B., Søga ard, A., and Go ldberg, Y . Multi- lingual part-of -speech tag ging with b idirectional long sh ort-term memor y models a nd au xiliary loss. In Pr oceeding s of the 54 th An nual Meeting of the Associatio n for Computation a l Linguistics (V olume 2: Sh ort P apers) , pp . 412–4 18, Berlin , Germany , Augu st 2016. Associatio n f or Computa- tional Ling uistics. do i: 10 .1865 3/v1/P16 - 2 067. URL https://www. aclweb.org/a nthology/P16- 2067 . Raiko, T ., V alpola, H., an d Lecu n, Y . Deep learning made easier by linear transform ations in pe r ceptron s. In Lawrence, N. D. an d Giro lami, M. (e d s.), Pr oceed ings of the F ifteenth Internation al Con fe rence o n Artificia l Intelligence and Statistics , volume 22 of Pr oceedin gs of Machine Learning Resear ch , pp . 9 24–9 32, La Palma, Canary Islan ds, 21 –23 Apr 2012. PMLR. URL http://proce edings.mlr.p ress/v22/raiko12.html . Schmidhu ber, J. Learn ing complex, extend ed sequ ences using the p rinciple o f history com p ression. Neural Com- putation , 4(2) :234–2 42, 1992. Schuster, M. and Paliw al, K. K. Bidirectional recurrent neu- ral networks. IEEE T ransactio ns on Sign al Pr o cessing , 45(11 ):2673 –2681, Nov 19 97. ISSN 105 3 -587 X . doi: 10.11 09/78. 6 50093. Silveira, N., Do z at, T ., d e Marneffe, M.-C., Bowman, S., Conn or , M., Bauer, J., a n d Manning , C. A gold standard dependency co rpus for English. In Pr o - ceedings of the Ninth Internation al Conference o n Language Reso u r ces a n d Evalu ation (LREC’14 ) , p p . 2897– 2904 , Reykja vik, Iceland , May 2014 . E uro- pean Lan guage Resources Associatio n (ELRA). URL http://www.l rec- conf.org/p roceedings/lrec2014/pdf / 1 0 8 9 _ P a p e r . p d f . Sutske ver , I., Martens, J., and Hinton, G. E. Generating text with recurr ent ne u ral networks. I n Pr oceed ings of Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k the 28th International Con fer ence on Machine Learning (ICML-11) , pp. 1017– 1024 , 2011. Sutske ver , I., V inyals, O., and Le, Q. V . Sequence to se- quence learning with neu ral networks. In Advan ces in neural information p r oce ssing systems , pp. 3 104– 3 112, 2014. Zhang, S., W u, Y ., Che, T ., Lin , Z., Memisevic, R., Salakhutdin ov , R. R., and Bengio, Y . Arch itec tu ral c o m- plexity measu r es of recurr ent neural networks. In L ee, D. D., Sugiyama, M., Luxburg, U. V ., Guy on, I., and Garnett, R. (eds.) , Advances in Neural Information P r o- cessing Systems 29 , pp . 18 22–18 30. Curran Associates, Inc., 2016. Zilly , J. G., Srivasta va, R. K., Koutník, J., and Sch mid- huber, J. Recur rent highway n e twork s. In Precup , D. and T eh, Y . W . (eds.), P r oceed ings of the 34th Internation al Confer ence o n Ma chine Learning , vol- ume 70 o f Pr oceedin gs of Machine Lea rning Resear ch , pp. 418 9–419 8, In ternation a l Conv ention Centre, Sydney , Australia, 06 –11 Aug 20 17. PMLR. URL http://proce edings.mlr.p ress/v70/zilly17a.html . Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k A. Theore m 1 P r oof Let us recall the n otation introduced in the main paper . W e use superscript ( i ) to refer to a we ight matrix o r vector r e- lated to layer i in a stacked n etwork, e.g. , W ( i ) h , or h ( i ) t . For a single-lay er d -RNN, we ref er to weight m a trices and related vector s with "hat", e.g ., ˆ W h or ˆ h t . Ad ditionally , we define the block notation as subvector ˆ v { i } t refers to the i -th block of vector ˆ v t composed of k block s. The b locks follow the definitio n in Equ ations (3)-(5). Pr oo f of Th eor em 1. W e prove Theorem 1 by induction on the sequence length t . First, we sh ow that for t = 1 the stacked RNN and the d-RNN with the con strained weights are equiv alent. Namely , for t = 1 we show that th e outputs and the h idden states ar e the same, i. e. ˆ y k = y 1 and ˆ h { i } i = h ( i ) 1 , respectively . Without loss of gen erality , we h av e f or any i in 1 . . . k the following: ˆ h { i } i = f { i }  ˆ W x x i + ˆ W h ˆ h i − 1 + ˆ b h  = f  ˆ W { i } x x i + W ( i ) x ˆ h { i − 1 } i − 1 + W ( i ) h ˆ h { i } i − 1 + b ( i ) h  = f  0 + W ( i ) x ˆ h { i − 1 } i − 1 + W ( i ) h h ( i ) 0 + b ( i ) h  = f  W ( i ) x · f  W ( i − 1) x ˆ h { i − 2 } i − 2 + W ( i − 1) h ˆ h { i − 1 } i − 2 + b ( i − 1) h  + W ( i ) h h ( i ) 0 + b ( i ) h  = f  W ( i ) x · f  W ( i − 1) x ˆ h { i − 2 } i − 2 + W ( i − 1) h h ( i − 1) 0 + b ( i − 1) h  + W ( i ) h h ( i ) 0 + b ( i ) h  = . . . = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x · f  W (1) x x 1 + W (1) h h (1) 0 + b (1) h  + W (2) h h (2) 0 + b (2) h  . . . + W ( j ) h h ( j ) 0 + b ( j ) h  . . . + W ( i ) h h ( i ) 0 + b ( i ) h  = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x h (1) 1 + W (2) h h (2) 0 + b (2) h  . . . + W ( j ) h h ( j ) 0 + b ( j ) h  . . . + W ( i ) h h ( i ) 0 + b ( i ) h  = . . . = f  W ( i ) x . . . f  W ( j ) x h ( j − 1) 1 + W ( j ) h h ( j ) 0 + b ( j ) h  . . . + W ( i ) h h ( i ) 0 + b ( i ) h  = . . . = f  W ( i ) x h ( i − 1) 1 + W ( i ) h h ( i ) 0 + b ( i ) h  = h ( i ) 1 , where we used the in itialization assum ption ˆ h { i } i − 1 = h ( i ) 0 for all i = 1 . . . k , and the definition o f the hidd en state in Equation s (3)-(4 ) for j − 1 blo cks, in the previous steps. In particular, we hav e for j = k , ˆ h { k } k = h ( k ) 1 . Plugging this result an d the definition o f the outpu t weights and biases in Equation (8) into Equation ( 2) for computin g the output, we o btain ˆ y k = g  ˆ W o ˆ h k + ˆ b o  = g  W o ˆ h { k } k + b o  = g  W o h ( k ) 1 + b o  = y 1 . (A.9) Which concludes the basis of th e induction. Next, we assume that ˆ h { i } t + i − 1 = h ( i ) t for all 1 ≤ i ≤ k and t ≤ T − 1 , and prove that it holds for the hidden states for all layers when t = T : ˆ h { i } T + i − 1 = h ( i ) T , ∀ 1 ≤ i ≤ k . W ithou t loss of ge nerality , we have for the hidden state ˆ h { i } T + i − 1 in Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k constrained weights single-layer d-RNN th a t, ˆ h { i } T + i − 1 = f { i }  ˆ W x x T + i − 1 + ˆ W h ˆ h T + i − 2 + ˆ b h  = f  ˆ W { i } x x T + i − 1 + W ( i ) x ˆ h { i − 1 } T + i − 2 + W ( i ) h ˆ h { i } T + i − 2 + b ( i ) h  = f  0 + W ( i ) x ˆ h { i − 1 } T + i − 2 + W ( i ) h ˆ h { i } T + i − 2 + b ( i ) h  = f  W ( i ) x · f  W ( i − 1) x ˆ h { i − 2 } T + i − 3 + W ( i − 1) h ˆ h { i − 1 } T + i − 3 + b ( i − 1) h  + W ( i ) h ˆ h { i } T + i − 2 + b ( i ) h  = . . . = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x f  W (1) x x T + W (1) h ˆ h { 1 } T − 1 + b (1) h  + W (2) h ˆ h { 2 } T + b (2) h  . . . + W ( j ) h ˆ h { j } T + j − 2 + b ( j ) h  . . . + W ( i ) h ˆ h { i } T + i − 2 + b ( i ) h  From the inductive assump tion we have ˆ h { j } T + j − 2 = h ( j ) T − 1 for all 1 ≤ j ≤ k , then it follows ˆ h { i } T + i − 1 = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x f  W (1) x x T + W (1) h h (1) T − 1 + b (1) h  + W (2) h h (2) T − 1 + b (2) h  . . . + W ( j ) h h ( j ) T − 1 + b ( j ) h  . . . + W ( i ) h h ( i ) T − 1 + b ( i ) h  = f  W ( i ) x . . . f  W ( j ) x . . . f  W (2) x h (1) T + W (2) h h (2) T − 1 + b (2) h  . . . + W ( j ) h h ( j ) T − 1 + b ( j ) h  . . . + W ( i ) h h ( i ) T − 1 + b ( i ) h  = . . . = f  W ( i ) x . . . f  W ( j ) x h ( j − 1) T + W ( j ) h h ( j ) T − 1 + b ( j ) h  . . . + W ( i ) h h ( i ) 0 + b ( i ) h  = . . . = f  W ( i ) x h ( i − 1) T + W ( i ) h h ( i ) T − 1 + b ( i ) h  = h ( i ) T , where we used the definition of the hidd en states in Equa- tions (3)-(4 ). In particular, we have for i = k that ˆ h { k } T + k − 1 = h ( k ) T . Now , we show th a t ˆ y T + k − 1 = y T . By the definition of th e output weights and biases in Equ a tion (8). and by the fact that ˆ h { k } T + k − 1 = h ( k ) T , we obtain ˆ y T + k − 1 = g  ˆ W o ˆ h T + k − 1 + ˆ b o  = g  W o ˆ h { k } T + k − 1 + b o  = g  W o h ( k ) T + b o  = y T , (A.10) which completes the proof.  B. Lemma 1 Proof W e show n ext tha t th ere exists an initialization vector that allows us to initialize the equiv alent single-lay er weight constrained d-RNN as defined in Th eorem 1. Pr oo f of Lemma 1. Fr om the su rjective definitio n of the a c - ti vation functio n f ( · ) , we know th at th e fun ction f ( · ) is right-invertible. Namely , ther e is a function r : D → R such that for any d ∈ D , r ( · ) satisties f ( r ( d )) = d . First, we no te th at for i = 1 , we have ˆ h { 1 } 0 = h (1) 0 . When i = 2 , we have h (2) 0 = ˆ h { 2 } 1 = f  W (2) x h (1) 0 + W (2) h ˆ h { 2 } 0 + b (2) h  . (B.11) From ( B.11 ) and the right-invertible functio n r ( · ) satisfies h (2) 0 = f  r  h (2) 0  , we obtain r  h (2) 0  = W (2) x h (1) 0 + W (2) h ˆ h { 2 } 0 + b (2) h = ⇒ ˆ h { 2 } 0 = W (2) h † h r  h (2) 0  − W (2) x h (1) 0 − b (2) h i , (B.12) where A † is the pseudoinv erse o f matrix A . W e assume that we obtained the initialization s for i − 1 and compute the initializatio n for block i .I n general, fo r block i we hav e h ( i ) 0 = ˆ h { i } i − 1 = f  W ( i ) x ˆ h { i − 1 } i − 1 + W ( i ) h ˆ h { i } i − 2 + b ( i ) h  Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k W e can p lug in the in itializatio n and the interm ediate com- puted hidden states for block i − 1 to obtain ˆ h { i } i − 2 = W ( i ) h † h r  h ( i ) 0  − W ( i ) x ˆ h { i − 1 } i − 1 − b ( i ) h i . W e continu e to reapp ly the recur si ve formu la on e step at a time until we reach the last step bef ore the initializatio n ˆ h { i } 0 : ˆ h { i } i − j = W ( i ) h † h r  ˆ h { i } i − j +1  − W ( i ) x ˆ h { i − 1 } i − j +1 − b ( i ) h i . . . ˆ h { i } 1 = f  W ( i ) x ˆ h { i − 1 } 1 + W ( i ) h ˆ h { i } 0 + b ( i ) h  = ⇒ ˆ h { i } 0 = W ( i ) h † h r  ˆ h { i } 1  − W ( i ) x ˆ h { i − 1 } 1 − b ( i ) h i , (B.13) Follo wing these steps from h ( i ) 0 to obtain ˆ h { i } 0 , we con- structed the initialization of the weight constrained d - RNN to ac c urately mimic th e initialization of th e stacked RNN.  C. Extension to d-LSTMs A Long Sh ort-T erm Memory recurren t cell ( Hochreiter & Schmidhube r , 1997 ) is given b y the in - troductio n of a cell state and a series of gates tha t co ntrol the up dates of the states. The cell state tog ether with the gates aim to solve the vanishing gr adients p r oblems in the RNN. The LSTM cell is highly popular an d we refer to the following imp lementation: ˆ e t = σ  ˆ W xe x t + ˆ W he ˆ h t − 1 + ˆ b e  , (C.14) ˆ f t = σ  ˆ W xf x t + ˆ W hf ˆ h t − 1 + ˆ b f  , (C.15) ˆ o t = σ  ˆ W xo x t + ˆ W ho ˆ h t − 1 + ˆ b o  , (C.16) ˆ g t = tanh  ˆ W xc x t + ˆ W hc ˆ h t − 1 + ˆ b c  , (C.17) ˆ c t = ˆ f t ⊙ ˆ c t − 1 + ˆ e t ⊙ ˆ g t , (C.18) ˆ h t = ˆ o t ⊙ tanh ( ˆ c t ) , (C.19) where ˆ e t is the in put gate, ˆ f t the forget gate, ˆ o t the output gate, ˆ g t the cell gate, ˆ c t the cell state, a nd ˆ h t the hid den state. The weig ht matrices are symbolized ˆ W xa and ˆ W ha as well as the bias ˆ b a , with a ∈ { e , c , f , o } b eing the re- spectiv e gate. Th e sy mbol ⊙ represents an elemen t-wise produ ct and σ ( · ) is the sigmoid fu nction. First, we n ote that th e set of Equation s ( C.14 )-( C.19 ) can be expanded in to the following two equatio ns: ˆ c t = σ  ˆ W xf x t + ˆ W hf ˆ h t − 1 + ˆ b f  ⊙ ˆ c t − 1 + σ  ˆ W xe x t + ˆ W he ˆ h t − 1 + ˆ b e  ⊙ tanh  ˆ W xc x t + ˆ W hc ˆ h t − 1 + ˆ b c  , (C.20 ) ˆ h t = σ  ˆ W xo x t + ˆ W ho ˆ h t − 1 + ˆ b o  ⊙ tanh ( ˆ c t ) . (C.21) Re writing the LSTM Equation s ( C.14 )-( C.19 ) in this fo r m, allows to rem ain with the r ecurren t eq uations wher e both ˆ h t and ˆ c t depend on the previous hidden and cell states, ˆ h t − 1 and ˆ c t − 1 , and the c urrent input x t . Next, we describ e th e weight matric e s for the single-layer d-LSTM that matches a stacked-LSTM with k lay ers. The matrices and biases follow the exact sam e pattern as th e RNN proof, being the same fo r all g ates. ˆ W ha =             W (1) ha 0 · · · 0 W (2) xa W (2) ha 0 . . . . . . . . . . . . . . . . . . W ( i ) xa W ( i ) ha . . . . . . . . . 0 0 · · · 0 W ( k ) xa W ( k ) ha             (C.22) ˆ b ha =     b (1) ha . . . b ( k ) ha     , ˆ W xa =      W (1) xa 0 . . . 0      , (C.2 3) where ˆ W xa ∈ R kn × q are the input weights, ˆ W ha ∈ R kn × kn the rec u rrent weights, ˆ b ha ∈ R kn the biases, for gate a ∈ { e , c , o , f } . W e follow the sam e no ta tio n f o r blocks and lay e rs introduced with Theor em 1. W e om it the equations fo r the o utput elem e nt ˆ y t as th ey ar e exactly the same as th e RNN in Theorem 1, and thus require the same steps for proving that outputs are equal, i.e., ˆ y T + k − 1 = y T . Therefo re, for the LSTM the o rem we will focus on th e hid- den and cell states. Theorem 2 . Given a n inpu t sequence { x t } t =1 ...T and a stacked LSTM with k laye rs, an d initial states { h ( i ) 0 , c ( i ) 0 } i =1 ...k , the d-LSTM with d elay d = k − 1 , de- fined b y Equ ations ( C.22 ) - ( C.23 ) an d initialized with ˆ h 0 such that ˆ h { i } i − 1 = h ( i ) 0 , ∀ i = 1 . . . k and ˆ c 0 such th at ˆ c { i } i − 1 = c ( i ) 0 , ∀ i = 1 . . . k , pr od uces the sa m e o utput se- quence but d e layed by k − 1 timesteps, i.e., ˆ y t + k − 1 = y t for all t = 1 . . . T . Further , the sequen ce of hidden and cell states a t each layer i ar e equ ivalent with d elay i − 1 , i.e., Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k ˆ h { i } t + i − 1 = h ( i ) t and ˆ c { i } t + i − 1 = c ( i ) t for a ll 1 ≤ i ≤ k a nd t ≥ 1 . Pr oo f. W e prove Theorem 2 b y inductio n on the sequence length t . First, we show that for t = 1 the stacked L ST M and the d-LST M with the constrained weights are equiv- alent. Namely , for t = 1 we show that the outputs, hid- den states and cell states are the same, i.e. ˆ y k = y 1 , ˆ h { i } i = h ( i ) 1 , a n d ˆ c { i } i = c ( i ) 1 , respectively . W ithou t loss of generality , we have for any j in 1 . . . k the following: ˆ h { i } i = σ  ˆ W { i } xo x i + ˆ W { i } ho ˆ h { i } i − 1 + ˆ b { i } o  ⊙ tanh  ˆ c { i } i  = σ  W ( i ) xo ˆ h { i − 1 } i − 1 + W ( i ) ho ˆ h { i } i − 1 + b ( i ) o  ⊙ tanh  ˆ c { i } i  = σ  W ( i ) xo ˆ h { i − 1 } i − 1 + W ( i ) ho h ( i ) 0 + b ( i ) o  ⊙ tanh  σ  W ( i ) xf ˆ h { i − 1 } i − 1 + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ ˆ c { i } i − 1 + σ  W ( i ) xe ˆ h { i − 1 } i − 1 + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc ˆ h { i − 1 } i − 1 + W ( i ) hc h ( i ) 0 + b ( i ) c  = . . . = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo σ  W (1) xo x 1 + W (1) ho h (1) 0 + b (1) o  ⊙ tanh  σ  W (1) xf x 1 + W (1) hf h (1) 0 + b (1) f  ⊙ c (1) 0 + σ  W (1) xe x 1 + W (1) he h (1) 0 + b (1) e  ⊙ tanh  W (1) xc x 1 + W (1) hc h (1) 0 + b (1) c  + W (2) ho h (2) 0 + b (2) o  ⊙ tanh  σ  W (2) xf ( . . . ) + W (2) hf h (2) 0 + b (2) f  ⊙ ˆ c { 2 } 1 + σ  W (2) xe ( . . . ) + W (2) he h (2) 0 + b (2) e  ⊙ tanh  W (2) xc ( . . . ) + W (2) hc h (2) 0 + b (2) c  . . . ] + W ( j ) ho h ( j ) 0 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf h ( j ) 0 + b ( j ) f  ⊙ ˆ c { j } j − 1 + σ  W ( j ) xe [ . . . ] + W ( j ) he h ( j ) 0 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc h ( j ) 0 + b ( j ) c o · · · + W ( i ) ho h ( i ) 0 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ ˆ c { i } i − 1 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) 0 + b ( i ) c  = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo h (1) 1 + W (2) ho h (2) 0 + b (2) o  ⊙ tanh  σ  W (2) xf h (1) 1 + W (2) hf h (2) 0 + b (2) f  ⊙ c (2) 0 + σ  W (2) xe h (1) 1 + W (2) he h (2) 0 + b (2) e  ⊙ tanh  W (2) xc h (1) 1 + W (2) hc h (2) 0 + b (2) c  . . . ] + W ( j ) ho h ( j ) 0 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf h ( j ) 0 + b ( j ) f  ⊙ c ( j ) 0 + σ  W ( j ) xe [ . . . ] + W ( j ) he h ( j ) 0 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc h ( j ) 0 + b ( j ) c o · · · + W ( i ) ho h ( i ) 0 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) 0 + b ( i ) c  = . . . = σ  W ( i ) xo . . . n σ  W ( j ) xo h ( j − 1) 1 + W ( j ) ho h ( j ) 0 + b ( j ) o  ⊙ ta nh  σ  W ( j ) xf h ( j − 1) 1 + W ( j ) hf h ( j ) 0 + b ( j ) f  ⊙ c ( j ) 0 + σ  W ( j ) xe h ( j − 1) 1 + W ( j ) he h ( j ) 0 + b ( j ) e  ⊙ tanh  W ( j ) xc h ( j − 1) 1 + W ( j ) hc h ( j ) 0 + b ( j ) c o · · · + W ( i ) ho h ( i ) 0 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) 0 + b ( i ) c  = . . . Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k = σ  W ( i ) xo h ( i − 1) 1 + W ( i ) ho h ( i ) 0 + b ( i ) o  ⊙ tanh  σ  W ( i ) xf h ( i − 1) 1 + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe h ( i − 1) 1 + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc h ( i − 1) 1 + W ( i ) hc h ( i ) 0 + b ( i ) c  = h ( i ) 1 , where we used the initializatio n assumptions ˆ h { i } i − 1 = h ( i ) 0 and ˆ c { i } i − 1 = c ( i ) 0 for a ll i = 1 . . . k , and th e defin ition of the hidd en and cell state in Equations ( C.20 ) and ( C.21 ) fo r j − 1 blocks, in the previous step s. In particu lar , we have for layer k that ˆ h { k } i = h ( k ) 1 , and using th e same tra ns- formation s as in ( A.9 ) with RNNs, we obtain ˆ y k = y 1 . Furthermo re, we obtained that: ˆ c { i } i = σ  ˆ W { i } xf x i + ˆ W { i } hf ˆ h i − 1 + ˆ b { i } f  ⊙ ˆ c { i } i − 1 + σ  ˆ W { i } xe x i + ˆ W { i } he ˆ h i − 1 + ˆ b { i } e  ⊙ tanh  ˆ W { i } xc x i + ˆ W { i } hc ˆ h i − 1 + ˆ b { i } c  = σ  W ( i ) xf ˆ h { i − 1 } i − 1 + W ( i ) hf ˆ h { i } i − 1 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe ˆ h { i − 1 } i − 1 + W ( i ) he ˆ h { i } i − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc ˆ h { i − 1 } i − 1 + W ( i ) hc ˆ h { i } i − 1 + b ( i ) c  = σ  W ( i ) xf h ( i − 1) 1 + W ( i ) hf h ( i ) 0 + b ( i ) f  ⊙ c ( i ) 0 + σ  W ( i ) xe h ( i − 1) 1 + W ( i ) he h ( i ) 0 + b ( i ) e  ⊙ tanh  W ( i ) xc h ( i − 1) 1 + W ( i ) hc h ( i ) 0 + b ( i ) c  = c ( i ) 1 Which concludes the basis of th e induction. Next, we assume that ˆ h { i } t + i − 1 = h ( i ) t and ˆ c { i } t + i − 1 = c ( i ) t for all 1 ≤ i ≤ k and t ≤ T − 1 , and pr ove that it ho ld s for th e hidd en and cell states f or all layers when t = T : ˆ h { i } T + i − 1 = h ( i ) T , ∀ 1 ≤ i ≤ k . W ithout loss o f generality , we hav e for the hidden state ˆ h { i } T + i − 1 in constrained weights single-layer d-LSTM that, ˆ h { i } T + i − 1 = σ  ˆ W { i } xo x T + i − 1 + ˆ W { i } ho ˆ h { i } T + i − 2 + ˆ b { i } o  ⊙ tanh  ˆ c { i } T + i − 1  = σ  W ( i ) xo ˆ h { i − 1 } T + i − 2 + W ( i ) ho ˆ h { i } T + i − 2 + b ( i ) o  ⊙ tanh  ˆ c { i } T + i − 1  = σ  W ( i ) xo ˆ h { i − 1 } T + i − 2 + W ( i ) ho ˆ h { i } T + i − 2 + b ( i ) o  ⊙ tanh  σ  W ( i ) xf ˆ h { i − 1 } T + i − 2 + W ( i ) hf ˆ h { i } T + i − 2 + b ( i ) f  ⊙ ˆ c { i } T + i − 2 + σ  W ( i ) xe ˆ h { i − 1 } T + i − 2 + W ( i ) he ˆ h { i } T + i − 2 + b ( i ) e  ⊙ tanh  W ( i ) xc ˆ h { i − 1 } T + i − 2 + W ( i ) hc ˆ h { i } T + i − 2 + b ( i ) c  = . . . = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo σ  W (1) xo x T + W (1) ho ˆ h { 1 } T − 1 + b (1) o  ⊙ tanh  σ  W (1) xf x T + W (1) hf ˆ h { 1 } T − 1 + b (1) f  ⊙ ˆ c { 1 } T − 1 + σ  W (1) xe x T + W (1) he ˆ h { 1 } T − 1 + b (1) e  ⊙ tanh  W (1) xc x T + W (1) hc ˆ h { 1 } T − 1 + b (1) c  + W (2) ho ˆ h { 2 } T + b (2) o  ⊙ tanh  σ  W (2) xf ( . . . ) + W (2) hf ˆ h { 2 } T + b (2) f  ⊙ ˆ c { 2 } T + σ  W (2) xe ( . . . ) + W (2) he ˆ h { 2 } T + b (2) e  ⊙ tanh  W (2) xc ( . . . ) + W (2) hc ˆ h { 2 } T + b (2) c  . . . ] + W ( j ) ho ˆ h { j } T + j − 2 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf ˆ h { j } T + j − 2 + b ( j ) f  ⊙ ˆ c { j } T + j − 2 + σ  W ( j ) xe [ . . . ] + W ( j ) he ˆ h { j } T + j − 2 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc ˆ h { j } T + j − 2 + b ( j ) c o · · · + W ( i ) ho ˆ h { i } T + i − 2 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf ˆ h { i } T + i − 2 + b ( i ) f  ⊙ ˆ c { i } T + i − 2 + σ  W ( i ) xe ( . . . ) + W ( i ) he ˆ h { i } T + i − 2 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc ˆ h { i } T + i − 2 + b ( i ) c  From the indu ctiv e assumptio n we have that ˆ h { j } T + j − 2 = h ( j ) T − 1 and ˆ c { j } T + j − 2 = c ( j ) T − 1 for all 1 ≤ j ≤ k , then it Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k follows that = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo σ  W (1) xo x T + W (1) ho h (1) T − 1 + b (1) o  ⊙ tanh  σ  W (1) xf x T + W (1) hf h (1) T − 1 + b (1) f  ⊙ c (1) T − 1 + σ  W (1) xe x T + W (1) he h (1) T − 1 + b (1) e  ⊙ tanh  W (1) xc x T + W (1) hc h (1) T − 1 + b (1) c  + W (2) ho h (2) T − 1 + b (2) o  ⊙ tanh  σ  W (2) xf ( . . . ) + W (2) hf h (2) T − 1 + b (2) f  ⊙ c (2) T − 1 + σ  W (2) xe ( . . . ) + W (2) he h (2) T − 1 + b (2) e  ⊙ tanh  W (2) xc ( . . . ) + W (2) hc h (2) T − 1 + b (2) c  . . . ] + W ( j ) ho h ( j ) T − 1 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf h ( j ) T − 1 + b ( j ) f  ⊙ c ( j ) T − 1 + σ  W ( j ) xe [ . . . ] + W ( j ) he h ( j ) T − 1 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc h ( j ) T − 1 + b ( j ) c o · · · + W ( i ) ho h ( i ) T − 1 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = σ  W ( i ) xo . . . n σ  W ( j ) xo [ . . . σ  W (2) xo h (1) T + W (2) ho h (2) T − 1 + b (2) o  ⊙ tanh  σ  W (2) xf h (1) T + W (2) hf h (2) T − 1 + b (2) f  ⊙ c (2) T − 1 + σ  W (2) xe h (1) T + W (2) he h (2) T − 1 + b (2) e  ⊙ tanh  W (2) xc h (1) T + W (2) hc h (2) T − 1 + b (2) c  . . . ] + W ( j ) ho h ( j ) T − 1 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf [ . . . ] + W ( j ) hf h ( j ) T − 1 + b ( j ) f  ⊙ c ( j ) T − 1 + σ  W ( j ) xe [ . . . ] + W ( j ) he h ( j ) T − 1 + b ( j ) e  ⊙ tanh  W ( j ) xc [ . . . ] + W ( j ) hc h ( j ) T − 1 + b ( j ) c o · · · + W ( i ) ho h ( i ) T − 1 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = . . . = σ  W ( i ) xo . . . n σ  W ( j ) xo h ( j − 1) T + W ( j ) ho h ( j ) T − 1 + b ( j ) o  ⊙ tanh  σ  W ( j ) xf h ( j − 1) T + W ( j ) hf h ( j ) T − 1 + b ( j ) f  ⊙ c ( j ) T − 1 + σ  W ( j ) xe h ( j − 1) T + W ( j ) he h ( j ) T − 1 + b ( j ) e  ⊙ tanh  W ( j ) xc h ( j − 1) T + W ( j ) hc h ( j ) T − 1 + b ( j ) c o · · · + W ( i ) ho h ( i ) T − 1 + b ( i ) o ⊙ tanh  σ  W ( i ) xf ( . . . ) + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe ( . . . ) + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc ( . . . ) + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = . . . = σ  W ( i ) xo h ( i − 1) T + W ( i ) ho h ( i ) T − 1 + b ( i ) o  ⊙ tanh  σ  W ( i ) xf h ( i − 1) T + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe h ( i − 1) T + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc h ( i − 1) T + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = h ( i ) T , where we u se the recurrent definition of th e hidden an d cell states in Equation s ( C.20 ) and ( C.2 1 ). In particular, we obtained for i = k that ˆ h { k } T + k − 1 = h ( k ) T . Ap plying the same steps as in th e d-RNN proo f in Eq. ( A.10 ), we ob tain ˆ y T + k − 1 = y T . Last, we obtain for the cell state that ˆ c { i } T + i − 1 = σ  ˆ W { i } xf x T + i − 1 + ˆ W { i } hf ˆ h T + i − 2 + ˆ b { i } f  ⊙ ˆ c { i } T + i − 2 + σ  ˆ W { i } xe x T + i − 1 + ˆ W { i } he ˆ h T + i − 2 + ˆ b { i } e  ⊙ tanh  ˆ W { i } xc x T + i − 1 + ˆ W { i } hc ˆ h T + i − 2 + ˆ b { i } c  = σ  W ( i ) xf ˆ h { i − 1 } T + i − 2 + W ( i ) hf ˆ h { i } T + i − 2 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe ˆ h { i − 1 } T + i − 2 + W ( i ) he ˆ h { i } T + i − 2 + b ( i ) e  ⊙ tanh  W ( i ) xc ˆ h { i − 1 } T + i − 2 + W ( i ) hc ˆ h { i } T + i − 2 + b ( i ) c  = σ  W ( i ) xf h ( i − 1) T + W ( i ) hf h ( i ) T − 1 + b ( i ) f  ⊙ c ( i ) T − 1 + σ  W ( i ) xe h ( i − 1) T + W ( i ) he h ( i ) T − 1 + b ( i ) e  ⊙ tanh  W ( i ) xc h ( i − 1) T + W ( i ) hc h ( i ) T − 1 + b ( i ) c  = c ( i ) T Which completes the proof.  Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k D. W ei g ht Constraints and Connections in d-RNN Figure 6 shows the weight con straints imposed to achieve equiv alence b etween the stacked RNN and single - layer d- RNN, and a v isualization of the d-RNN as co nnection s in the stacked RNN. Figure 6 (b) depicts the delay (or “ sh ift”) of all the h idden states as they would b e comp uted in the stacked RNN. Each layer is equ iv a lent to a shift by o ne timestep. E. Additional Plo ts f or Error Maps Figure 7 p r esent the standard deviation d iagrams fo r the error maps in Fig ure 5. F . Masked Character -Leve l Language Modeling: Additional Results In T able 3 , we includ e a d ditional results for smaller net- works of the masked language model task. W e sam- pled mor e d elay values f o r d-LSTMs, but the general con- clusions remain the same: intermediate values o f delay achieve the lowest BPC. Forward-pass runtimes across de- lay values sho w a small increase with larger delays, but the increment is relatively flat co mpared to stacked LSTMs or (stacked) Bi-LSTMs as they increase in d e pth. For these experiments, we also used a batch of 128 sequ ences, and an embeddin g of dimension 10 . G. Part-of-Spee ch T agg ing: Additional Details and R esults In this section, we include m ore d etails ab out th e d ataset and the r esults o f all th e com binations f o r th e Parts-Of- Speech experim ent. W e u sed treeban ks from Univer - sal Depend encies (UD) ( Ni vre et al. , 20 16 ) version 2.3. W e selected the Eng lish E WT treebank 2 ( Silveira et al. , 2014 ) (2 54,85 4 words), Fr ench GSD treeb ank 3 (411,4 65 words), and Germ a n GSD treeban k 4 (297,8 36 words) based on th e quality assigne d by the UD au thors. W e follow the partitio n ing on to training, validation and test datasets as pre- d efined in UD. A ll treeban ks use the sam e POS tag set containing 17 tags. W e use the Polyglot project ( Al-Rfou’ et a l. , 2013 ) word embedd ings (64 di- mensions). W e build our own alphab ets based on the most frequen t 1 00 char acters in the vocabularies. All th e ne t- works h ave a 100 -dimensio nal character-le vel embed d ing, which is train ed with the network. W e use a batch size of 32 sentences. 2 https://githu b.com/Univers alDependencie s/UD_English- EW T/tree/r2.3 3 https://githu b.com/Univers alDependencie s/UD_French- GSD /tree/r2.3 4 https://githu b.com/Univers alDependencie s/UD_German- GSD /tree/r2.3 Results for Germ an, Eng lish, and Fr ench can be fou nd in T ab les 4 , 5 , and 6 , respectively . The best result that does not use a bidirection al network is marked in b old for each languag e . Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k (a) (b) Figure 6. (a) W ei ghts of t he single-layer and weight constrained d-RNN that are equiv alent t o connections in the stacked RNN from Figure 2. (b) Connection s in the d-RNN based on the weight matrix in (a). The d-RNN is depicted as it would be the stacked RNN. The hidden states are delayed in time with respect to the stacked network. T able 3. Performance for smaller networks on the mask ed character-le vel language modeling task. Mean and standard de viation v alues are computed over 5 repetitions of training and inference runtime on the test set. M O D E L L AY E R S D E L AY U N I T S P A R A M S . V A L . B P C T E S T B P C R U N T I M E L S T M 1 - 5 1 2 1 0 8 7 2 8 3 2 . 139 ± 0 . 00 5 2 . 195 ± 0 . 002 2 . 85 ms ± 0 . 14 L S T M 2 - 2 9 8 1 0 9 0 6 8 9 2 . 156 ± 0 . 00 3 2 . 215 ± 0 . 00 2 6 . 69 ms ± 0 . 27 L S T M 5 - 1 7 2 1 0 8 3 7 3 5 2 . 199 ± 0 . 01 6 2 . 255 ± 0 . 01 5 11 . 32 ms ± 0 . 05 B I - L S T M 1 - 3 6 0 1 0 9 1 1 0 7 1 . 130 ± 0 . 00 3 1 . 187 ± 0 . 00 4 5 . 82 ms ± 0 . 18 B I - L S T M 2 - 1 8 2 1 0 9 0 4 8 7 0 . 800 ± 0 . 00 4 0 . 846 ± 0 . 00 5 11 . 08 ms ± 0 . 59 B I - L S T M 5 - 1 0 2 1 1 0 4 1 5 1 0 . 796 ± 0 . 00 7 0 . 841 ± 0 . 006 23 . 94 ms ± 0 . 17 D - L S T M 1 1 5 1 2 1 0 8 7 2 8 3 1 . 470 ± 0 . 00 2 1 . 518 ± 0 . 00 3 2 . 80 ms ± 0 . 02 D - L S T M 1 2 5 1 2 1 0 8 7 2 8 3 1 . 162 ± 0 . 00 4 1 . 208 ± 0 . 00 3 2 . 81 ms ± 0 . 01 D - L S T M 1 3 5 1 2 1 0 8 7 2 8 3 0 . 995 ± 0 . 00 2 1 . 039 ± 0 . 00 2 3 . 02 ms ± 0 . 23 D - L S T M 1 5 5 1 2 1 0 8 7 2 8 3 0 . 877 ± 0 . 00 1 0 . 920 ± 0 . 00 3 3 . 01 ms ± 0 . 22 D - L S T M 1 8 5 1 2 1 0 8 7 2 8 3 0 . 859 ± 0 . 00 2 0 . 905 ± 0 . 003 3 . 04 ms ± 0 . 19 D - L S T M 1 1 0 5 1 2 1 0 8 7 2 8 3 0 . 889 ± 0 . 00 4 0 . 935 ± 0 . 00 5 3 . 22 ms ± 0 . 18 D - L S T M 1 1 5 5 1 2 1 0 8 7 2 8 3 0 . 971 ± 0 . 00 4 1 . 014 ± 0 . 00 2 3 . 17 ms ± 0 . 05 Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a −4 .0 −3 .2 −2 .4 −1 .6 −0 .8 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 (a) L STM 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a −4 .0 −3 .2 −2 .4 −1 .6 −0 .8 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 (b) Bi - L STM 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a −4 .0 −3 .2 −2 .4 −1 .6 −0 .8 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l t e r A c a u sa l i t y a 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 (c) d-LS TM with delay=5 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l te r A c a u sa l i t y a −4 .0 −3 .2 −2 .4 −1 .6 −0 .8 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 S c a l e γ 1 0 9 8 7 6 5 4 3 2 1 0 F i l te r A c a u sa l i t y a 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 (d) d-LSTM with delay=10 Figure 7. Error maps presen ted in Figure 4 (left column) t ogether with their standard deviation figures. Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k T able 4. Parts-of-Speech results for German. The table sho ws all possible combinations of delays or bidirectional LSTM networks. The best forward-only network is marke d in bold. C H A R A C T E R - L E V E L N E T W O R K W O R D - L E V E L N E T W O R K V A L I D AT I O N A C C U R A C Y T E S T A C C U R A C Y B I - L S T M B I - L S T M 93 . 8 8 ± 0 . 13 93 . 15 ± 0 . 08 B I - L S T M L S T M 92 . 00 ± 0 . 16 91 . 5 0 ± 0 . 05 B I - L S T M D - L S T M W I T H D E L AY = 1 93 . 32 ± 0 . 23 92 . 81 ± 0 . 14 B I - L S T M D - L S T M W I T H D E L AY = 2 93 . 15 ± 0 . 06 92 . 67 ± 0 . 08 B I - L S T M D - L S T M W I T H D E L AY = 3 92 . 82 ± 0 . 14 92 . 25 ± 0 . 16 B I - L S T M D - L S T M W I T H D E L AY = 4 92 . 41 ± 0 . 12 91 . 95 ± 0 . 17 B I - L S T M D - L S T M W I T H D E L AY = 5 91 . 86 ± 0 . 11 91 . 57 ± 0 . 20 L S T M B I - L S T M 93 . 96 ± 0 . 12 93 . 43 ± 0 . 07 L S T M L S T M 92 . 05 ± 0 . 16 91 . 5 8 ± 0 . 11 L S T M D - L S T M W I T H D E L AY = 1 93 . 46 ± 0 . 16 92 . 71 ± 0 . 11 L S T M D - L S T M W I T H D E L AY = 2 93 . 13 ± 0 . 10 92 . 61 ± 0 . 26 L S T M D - L S T M W I T H D E L AY = 3 92 . 91 ± 0 . 13 92 . 38 ± 0 . 15 L S T M D - L S T M W I T H D E L AY = 4 92 . 56 ± 0 . 17 92 . 06 ± 0 . 19 D - L S T M W I T H D E L AY = 1 B I - L S T M 9 3 . 93 ± 0 . 06 93 . 39 ± 0 . 18 D - L S T M W I T H D E L AY = 1 L S T M 92 . 04 ± 0 . 11 91 . 58 ± 0 . 1 4 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 1 93 . 48 ± 0 . 31 92 . 87 ± 0 . 24 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 2 93 . 11 ± 0 . 18 92 . 54 ± 0 . 08 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 3 92 . 85 ± 0 . 14 92 . 28 ± 0 . 19 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 4 92 . 50 ± 0 . 12 92 . 11 ± 0 . 19 D - L S T M W I T H D E L AY = 3 B I - L S T M 9 4 . 00 ± 0 . 17 93 . 32 ± 0 . 18 D - L S T M W I T H D E L AY = 3 L S T M 92 . 10 ± 0 . 24 91 . 61 ± 0 . 1 8 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 1 93 . 29 ± 0 . 09 92 . 68 ± 0 . 09 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 2 93 . 09 ± 0 . 21 92 . 59 ± 0 . 16 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 3 92 . 86 ± 0 . 24 92 . 42 ± 0 . 16 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 4 92 . 53 ± 0 . 17 92 . 08 ± 0 . 18 D - L S T M W I T H D E L AY = 5 B I - L S T M 9 3 . 88 ± 0 . 17 93 . 27 ± 0 . 06 D - L S T M W I T H D E L AY = 5 L S T M 91 . 88 ± 0 . 18 91 . 54 ± 0 . 1 1 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 1 93 . 31 ± 0 . 14 92 . 74 ± 0 . 10 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 2 93 . 17 ± 0 . 13 92 . 57 ± 0 . 17 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 3 92 . 84 ± 0 . 19 92 . 25 ± 0 . 10 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 4 92 . 50 ± 0 . 22 91 . 96 ± 0 . 19 Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k T able 5. Parts-of-Speech results for English. T he t able shows all possible combinations of delays or bidirectional LSTM networks. T he best forward-only network is marke d in bold. C H A R A C T E R - L E V E L N E T W O R K W O R D - L E V E L N E T W O R K V A L I D AT I O N A C C U R A C Y T E S T A C C U R A C Y B I - L S T M B I - L S T M 94 . 8 5 ± 0 . 05 94 . 84 ± 0 . 08 B I - L S T M L S T M 91 . 90 ± 0 . 12 92 . 0 5 ± 0 . 09 B I - L S T M D - L S T M W I T H D E L AY = 1 94 . 47 ± 0 . 06 94 . 41 ± 0 . 05 B I - L S T M D - L S T M W I T H D E L AY = 2 94 . 17 ± 0 . 13 94 . 14 ± 0 . 10 B I - L S T M D - L S T M W I T H D E L AY = 3 93 . 70 ± 0 . 07 93 . 87 ± 0 . 07 B I - L S T M D - L S T M W I T H D E L AY = 4 93 . 11 ± 0 . 14 93 . 26 ± 0 . 08 B I - L S T M D - L S T M W I T H D E L AY = 5 92 . 54 ± 0 . 16 92 . 70 ± 0 . 10 L S T M B I - L S T M 95 . 03 ± 0 . 14 94 . 99 ± 0 . 15 L S T M L S T M 92 . 05 ± 0 . 13 92 . 1 4 ± 0 . 10 L S T M D - L S T M W I T H D E L AY = 1 94 . 53 ± 0 . 08 94 . 58 ± 0 . 11 L S T M D - L S T M W I T H D E L AY = 2 94 . 29 ± 0 . 05 94 . 28 ± 0 . 05 L S T M D - L S T M W I T H D E L AY = 3 93 . 81 ± 0 . 11 93 . 85 ± 0 . 12 L S T M D - L S T M W I T H D E L AY = 4 93 . 39 ± 0 . 12 93 . 55 ± 0 . 10 D - L S T M W I T H D E L AY = 1 B I - L S T M 9 4 . 94 ± 0 . 07 94 . 95 ± 0 . 06 D - L S T M W I T H D E L AY = 1 L S T M 91 . 96 ± 0 . 16 92 . 09 ± 0 . 1 0 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 1 94 . 57 ± 0 . 08 94 . 57 ± 0 . 14 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 2 94 . 29 ± 0 . 12 94 . 37 ± 0 . 08 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 3 93 . 86 ± 0 . 05 93 . 84 ± 0 . 10 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 4 93 . 35 ± 0 . 10 93 . 56 ± 0 . 13 D - L S T M W I T H D E L AY = 3 B I - L S T M 9 4 . 98 ± 0 . 09 94 . 91 ± 0 . 10 D - L S T M W I T H D E L AY = 3 L S T M 91 . 96 ± 0 . 08 92 . 08 ± 0 . 1 0 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 1 94 . 47 ± 0 . 03 94 . 51 ± 0 . 10 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 2 94 . 21 ± 0 . 05 94 . 18 ± 0 . 03 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 3 93 . 80 ± 0 . 13 93 . 88 ± 0 . 13 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 4 93 . 23 ± 0 . 13 93 . 38 ± 0 . 11 D - L S T M W I T H D E L AY = 5 B I - L S T M 9 4 . 90 ± 0 . 07 94 . 87 ± 0 . 09 D - L S T M W I T H D E L AY = 5 L S T M 91 . 84 ± 0 . 11 91 . 98 ± 0 . 2 0 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 1 94 . 36 ± 0 . 09 94 . 44 ± 0 . 08 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 2 94 . 05 ± 0 . 07 94 . 19 ± 0 . 05 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 3 93 . 61 ± 0 . 07 93 . 76 ± 0 . 05 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 4 93 . 14 ± 0 . 04 93 . 27 ± 0 . 12 Appr oximating Stacked and Bidirectional Recurr ent Architectures with the Del ayed Recurrent Neural Networ k T able 6. Parts-of-Speech results for French. The table sho ws all possible combination s of delays or bidirectional LSTM networks. The best forward-only network is marke d in bold. C H A R A C T E R - L E V E L N E T W O R K W O R D - L E V E L N E T W O R K V A L I D AT I O N A C C U R A C Y T E S T A C C U R A C Y B I - L S T M B I - L S T M 97 . 6 3 ± 0 . 06 97 . 22 ± 0 . 11 B I - L S T M L S T M 96 . 67 ± 0 . 05 96 . 1 5 ± 0 . 17 B I - L S T M D - L S T M W I T H D E L AY = 1 97 . 48 ± 0 . 02 96 . 98 ± 0 . 05 B I - L S T M D - L S T M W I T H D E L AY = 2 97 . 41 ± 0 . 02 96 . 91 ± 0 . 12 B I - L S T M D - L S T M W I T H D E L AY = 3 97 . 31 ± 0 . 05 96 . 84 ± 0 . 09 B I - L S T M D - L S T M W I T H D E L AY = 4 97 . 12 ± 0 . 05 96 . 61 ± 0 . 06 B I - L S T M D - L S T M W I T H D E L AY = 5 96 . 88 ± 0 . 10 96 . 20 ± 0 . 14 L S T M B I - L S T M 97 . 70 ± 0 . 07 97 . 19 ± 0 . 09 L S T M L S T M 96 . 67 ± 0 . 07 96 . 1 0 ± 0 . 11 L S T M D - L S T M W I T H D E L AY = 1 97 . 49 ± 0 . 07 97 . 03 ± 0 . 07 L S T M D - L S T M W I T H D E L AY = 2 97 . 49 ± 0 . 05 97 . 00 ± 0 . 06 L S T M D - L S T M W I T H D E L AY = 3 97 . 34 ± 0 . 04 96 . 89 ± 0 . 09 L S T M D - L S T M W I T H D E L AY = 4 97 . 16 ± 0 . 06 96 . 66 ± 0 . 15 D - L S T M W I T H D E L AY = 1 B I - L S T M 9 7 . 67 ± 0 . 07 97 . 23 ± 0 . 12 D - L S T M W I T H D E L AY = 1 L S T M 96 . 66 ± 0 . 06 95 . 97 ± 0 . 0 7 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 1 97 . 49 ± 0 . 04 97 . 04 ± 0 . 13 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 2 97 . 43 ± 0 . 05 96 . 98 ± 0 . 05 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 3 97 . 36 ± 0 . 08 96 . 80 ± 0 . 10 D - L S T M W I T H D E L AY = 1 D - L S T M W I T H D E L AY = 4 97 . 22 ± 0 . 06 96 . 57 ± 0 . 10 D - L S T M W I T H D E L AY = 3 B I - L S T M 9 7 . 67 ± 0 . 08 97 . 21 ± 0 . 08 D - L S T M W I T H D E L AY = 3 L S T M 96 . 67 ± 0 . 07 95 . 98 ± 0 . 1 4 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 1 97 . 52 ± 0 . 04 97 . 02 ± 0 . 09 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 2 97 . 44 ± 0 . 02 96 . 97 ± 0 . 12 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 3 97 . 28 ± 0 . 04 96 . 74 ± 0 . 07 D - L S T M W I T H D E L AY = 3 D - L S T M W I T H D E L AY = 4 97 . 13 ± 0 . 05 96 . 57 ± 0 . 09 D - L S T M W I T H D E L AY = 5 B I - L S T M 9 7 . 61 ± 0 . 03 97 . 12 ± 0 . 06 D - L S T M W I T H D E L AY = 5 L S T M 96 . 64 ± 0 . 06 96 . 08 ± 0 . 0 8 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 1 97 . 46 ± 0 . 02 96 . 96 ± 0 . 13 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 2 97 . 41 ± 0 . 06 96 . 87 ± 0 . 06 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 3 97 . 36 ± 0 . 05 96 . 82 ± 0 . 07 D - L S T M W I T H D E L AY = 5 D - L S T M W I T H D E L AY = 4 97 . 15 ± 0 . 05 96 . 51 ± 0 . 07

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment