Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

Recent advances in conditional recurrent language modelling have mainly focused on network architectures (e.g., attention mechanism), learning algorithms (e.g., scheduled sampling and sequence-level training) and novel applications (e.g., image/video…

Authors: Kyunghyun Cho

Noisy Parallel A ppr oximate Decoding f or Condition al Recurrent Languag e Model K y unghyun Cho New Y ork Univ ersity kyunghyun.ch o@nyu.edu Abstract Recent adv ances in condition al recur rent langua ge mod elling h av e mainly focused on network architectures (e.g., attention mechanism) , learning algorithm s (e.g., scheduled sampling and sequence-level train ing) and nov el applications (e.g., im- age/video descrip tion generatio n, speech recog nition, etc.) On the other han d, we notice that decoding algorithms/strategies h av e not been in vestigated as much, and it has become standard to use greed y o r beam search. In this paper, we pro- pose a novel decoding strategy motivated by an earlier observation that non linear hidden layers of a deep neural network stre tch the d ata man ifold. Th e pr oposed strategy is em barrassingly parallelizab le without any co mmunicatio n ov erhead, while im proving an existing decoding alg orithm. W e e xtensively e valuate it with attention-b ased neural machine translation on the task of En → Cz translation. 1 Intr oduction Since its first u se as a language m odel in 20 10 [19], a recurre nt neu ral network has beco me a de facto ch oice for im plementin g a lang uage model [ 28, 25] . One of the app ealing pro perties of this approa ch to lang uage mo delling, to wh ich we refer as r ecurr en t langu age modelling , is th at a re- current language model can generate a lo ng, cohere nt senten ce [ 26]. This is due to the ab ility of a recurren t neural network to capture long-term dependen cies. This prop erty has come under spotlight in recent years as the conditio nal version of a recur rent lan- guage model began to be used in many different p roblem s that require generating a natural language description of a high-dimen sional, complex input. These tasks in clude machine tran slation, speech recogn ition, image/video description generation and many more [9] and references therein. Much o f th e r ecent ad vances in con ditional recu rrent langu age mode l have focused either on n etwork architecture s (e.g. , [ 1]), learning algorith ms (e.g ., [4 , 22, 2]) o r novel applica tions (see [9] and referenc es therein). On the other hand, we notice that there has not been much research on decoding algorithm s for conditional recurre nt languag e models. In the most of work using recurrent language models, it is a comm on pr actice to use either greedy or beam search to find the most likely natu ral languag e description given an input. In this paper, we in vestigate whether it is possible to decode better from a conditional recurr ent l an- guage model. More specifically , we propo se a de coding strategy motiv ated b y earlier o bservations that no nlinear hidden layers of a d eep neur al network stretch the data manifo ld such that a n eigh- bourh ood in th e hid den state space co rrespon ds to a set o f seman tically similar configuratio ns in the input sp ace [6 ]. This observation is exploited in the p roposed strategy by injecting n oise in th e hidden transition function of a recurren t language model. The proposed strategy , called noisy parallel approx imate dec oding (NP AD), is a m eta-algor ithm that runs in p arallel many chains o f the noisy version of an inner decoding algorith m, such as greedy or beam search. Onc e those parallel chains generate the candidates, the NP AD selects the one with the 1 highest score. As th ere is effecti vely no co mmunica tion overhead d uring decodin g, th e wall-clock perfor mance of the pro posed NP AD is compa rable to a single run o f an inner decodin g algorithm in a distributed setting , while it improves the perfo rmance of the inn er decoding algorithm. W e empirically e valuate the proposed NP AD against the greed y search , b eam search as well as stoch astic sampling and diverse decoding [16] in attention-ba sed neural machine translation. 2 Conditional Recurr ent Language Model A language model aims at modelling a proba bilistic distrib ution over n atural languag e text. A recur- rent languag e model is a languag e model implemented as a recu rrent neural network [18 ]. Let us define a p robability of a given natural language sentence, 1 which we repr esent as a sequence of linguistic symbols X = ( x 1 , x 2 , . . . , x T ) , as p ( X ) = p ( x 1 , x 2 , . . . , x T ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) · · · p ( x T | x