Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

Noisy Parallel A ppr oximate Decoding f or Condition al Recurrent Languag e Model K y unghyun Cho New Y ork Univ ersity kyunghyun.ch o@nyu.edu Abstract Recent adv ances in condition al recur rent langua ge mod elling h av e mainly focused on network architectures (e.g., attention mechanism) , learning algorithm s (e.g., scheduled sampling and sequence-level train ing) and nov el applications (e.g., im- age/video descrip tion generatio n, speech recog nition, etc.) On the other han d, we notice that decoding algorithms/strategies h av e not been in vestigated as much, and it has become standard to use greed y o r beam search. In this paper, we pro- pose a novel decoding strategy motivated by an earlier observation that non linear hidden layers of a deep neural network stre tch the d ata man ifold. Th e pr oposed strategy is em barrassingly parallelizab le without any co mmunicatio n ov erhead, while im proving an existing decoding alg orithm. W e e xtensively e valuate it with attention-b ased neural machine translation on the task of En → Cz translation. 1 Intr oduction Since its ﬁrst u se as a language m odel in 20 10 [19], a recurre nt neu ral network has beco me a de facto ch oice for im plementin g a lang uage model [ 28, 25] . One of the app ealing pro perties of this approa ch to lang uage mo delling, to wh ich we refer as r ecurr en t langu age modelling , is th at a re- current language model can generate a lo ng, cohere nt senten ce [ 26]. This is due to the ab ility of a recurren t neural network to capture long-term dependen cies. This prop erty has come under spotlight in recent years as the conditio nal version of a recur rent lan- guage model began to be used in many different p roblem s that require generating a natural language description of a high-dimen sional, complex input. These tasks in clude machine tran slation, speech recogn ition, image/video description generation and many more [9] and references therein. Much o f th e r ecent ad vances in con ditional recu rrent langu age mode l have focused either on n etwork architecture s (e.g. , [ 1]), learning algorith ms (e.g ., [4 , 22, 2]) o r novel applica tions (see [9] and referenc es therein). On the other hand, we notice that there has not been much research on decoding algorithm s for conditional recurre nt languag e models. In the most of work using recurrent language models, it is a comm on pr actice to use either greedy or beam search to ﬁnd the most likely natu ral languag e description given an input. In this paper, we in vestigate whether it is possible to decode better from a conditional recurr ent l an- guage model. More speciﬁcally , we propo se a de coding strategy motiv ated b y earlier o bservations that no nlinear hidden layers of a d eep neur al network stretch the data manifo ld such that a n eigh- bourh ood in th e hid den state space co rrespon ds to a set o f seman tically similar conﬁguratio ns in the input sp ace [6 ]. This observation is exploited in the p roposed strategy by injecting n oise in th e hidden transition function of a recurren t language model. The proposed strategy , called noisy parallel approx imate dec oding (NP AD), is a m eta-algor ithm that runs in p arallel many chains o f the noisy version of an inner decoding algorith m, such as greedy or beam search. Onc e those parallel chains generate the candidates, the NP AD selects the one with the 1 highest score. As th ere is effecti vely no co mmunica tion overhead d uring decodin g, th e wall-clock perfor mance of the pro posed NP AD is compa rable to a single run o f an inner decodin g algorithm in a distributed setting , while it improves the perfo rmance of the inn er decoding algorithm. W e empirically e valuate the proposed NP AD against the greed y search , b eam search as well as stoch astic sampling and diverse decoding [16] in attention-ba sed neural machine translation. 2 Conditional Recurr ent Language Model A language model aims at modelling a proba bilistic distrib ution over n atural languag e text. A recur- rent languag e model is a languag e model implemented as a recu rrent neural network [18 ]. Let us deﬁne a p robability of a given natural language sentence, 1 which we repr esent as a sequence of linguistic symbols X = ( x 1 , x 2 , . . . , x T ) , as p ( X ) = p ( x 1 , x 2 , . . . , x T ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) · · · p ( x T | x

Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment