Deep Learning with Dynamic Computation Graphs

Published as a conference paper at ICLR 2017 D E E P L E A R N I N G W I T H D Y N A M I C C O M P U T A T I O N G R A P H S Moshe Looks, Marcello Herr eshoff, DeLesley Hutchins & Peter Nor vig Google Inc. { madscience,marcelloh,delesley,pnorvig } @google.com A B S T R A C T Neural networks that compute over graph structures are a natural ﬁt for problems in a variety of domains, including natural language (parse trees) and cheminfor- matics (molecular graphs). Howe ver , since the computation graph has a different shape and size for ev ery input, such networks do not directly support batched training or inference. They are also difﬁcult to implement in popular deep learn- ing libraries, which are based on static data-ﬂow graphs. W e introduce a technique called dynamic batching, which not only batches together operations between dif- ferent input graphs of dissimilar shape, but also between different nodes within a single input graph. The technique allows us to create static graphs, using popu- lar libraries, that emulate dynamic computation graphs of arbitrary shape and size. W e further present a high-le vel library 1 of compositional blocks that simpliﬁes the creation of dynamic graph models. Using the library , we demonstrate concise and batch-wise parallel implementations for a variety of models from the literature. 1 I N T R O D U C T I O N T raining deep neural networks directly on minimally pre-processed corpora has led to many recent performance breakthroughs, mainly on problems in domains such as vision (Krizhe vsky et al., 2012) and natural language (Bahdanau et al., 2015) where the inputs can be cast as dense n -dimensional arrays (henceforth tensors ), or sequences of tensors. These successes exploit the effecti veness of training via gradient descent on mini-batches of tens to hundreds of inputs, implemented using the parallel SIMD capabilities of modern GPUs (Oh & Jung, 2004) and multi-core CPUs (V anhoucke et al., 2011). This, in turn has led to a proliferation of libraries making it easier to train and deploy such models, by expressing them in terms of dif ferentiable data-ﬂow graphs over tensors (Abadi et al., 2016; Theano Dev elopment T eam, 2016; Collobert et al., 2011). Howe ver , there is also a long history of neural networks that compute over structures such as parse trees (Pollack, 1990), logical terms (Goller & Kuchler, 1996), and molecular graphs (Bianucci et al., 2000). In these models, each distinct input has a different computation graph structure; we say that they use dynamic computation gr aphs (DCGs). Such models continue to be developed and ha ve recently yielded superior results on problems such as sentiment classiﬁcation and semantic related- ness (T ai et al., 2015; Li et al., 2015), question-answering (Andreas et al., 2016), and screening of chemical compounds (K earnes et al., 2016). Despite these successes, most practitioners a void DCGs for implementation reasons. For example, Bo wman et al. (2016) assert that “because T reeRNNs use a different model structure for each sentence ... ef ﬁcient batching is impossible in standard imple- mentations”. Moreover , even if efﬁcient batching were possible in principle, current libraries such as T ensorFlo w (Abadi et al., 2016) assume that the data-ﬂo w graph is static (i.e. is the same for each input) and impose a signiﬁcant cost to graph construction, which makes it infeasible to build a ne w graph for each input. Section 2 introduces dynamic batching, which enables efﬁcient batching for training and inference with DCGs. Dynamic batching runs DCGs ef ﬁciently with existing libraries that only support static data-ﬂow graphs; e.g. the same static graph can run a T reeRNN over any parse tree. W e present empirical results for our implementation in T ensorFlo w . Section 3 presents a combinator library for concisely implementing models with DCGs using dynamic batching. Section 4 concludes. 1 The library is called T ensorFlow Fold and li ves at http://github.com/tensorflow/fold . 1 Published as a conference paper at ICLR 2017 2 D Y N A M I C B A T C H I N G In deep learning libraries like T ensorFlow , computations are manually batched. The computation is expressed as a static graph of mathematical operations, such as y = σ ( x · w + c ) , which are polymorphic in batch size; an input x of dimensions ( b, n ) will yield an output of dimensions ( b, m ) , where b is the batch size. With DCGs, the graph of operations is not static, but is assumed to be different for ev ery input, so multiple inputs no longer naturally batch together in the same way . The dynamic batching algorithm ov ercomes this dif ﬁculty . Giv en a set of computation graphs as input, each of which has a different size and topology , it will rewrite the graphs by batching together all instances of the same operation that occur at the same depth in the graph. The rewriting process inserts additional concat and gather operations to mo ve data between the batched operations; the indices to gather encode the topology of the original input graphs. W e distinguish between indi vidual operations appearing as nodes in the underlying data-ﬂo w graph, such as addition or matrix-multiply , and small sub-graphs that conceptually act as functions over tensors, such as a feed-forward layer or LSTM cell. W e refer to the former as “ops”, and to the latter as “operations. ” Operations, (i.e. sub-graphs), form the building-blocks from which neural networks with DCGs are composed; dynamic batching schedules operations, not ops. Our algorithm requires that all operations which might be used be speciﬁed in adv ance, and it enumerates them for scheduling purposes. For example, a binary T reeRNN for NLP parse trees has two operations: embedding table lookups for words at the lea ves of the tree, and RNN cells for the non-terminals. The inputs and outputs of operations hav e tensor types . Each input or output may hav e a different type, but all types must be ﬁxed and fully speciﬁed in advance. A tensor type consists of a shape , x 1 , . . . x n , together with a scalar data type (e.g. float32 ). The inputs to an operation shall be tensors of dimension ( b, x 1 , . . . x n ) , where b is the batch size and x 1 . . . x n is the shape of corre- sponding input tensor type. The outputs must all be tensors of dimension ( b, y 1 , . . . y m ) , where y 1 , . . . y m is the shape of the corresponding output tensor type. Operations must be polymorphic with respect to the batch size, because the batch size will change each time the operation is in vok ed, depending on the topologies of the input graphs. Howe ver , their tensor types are ﬁxed, so that it is possible to assign a known tensor type to each edge in the input computation graph. The dynamic batching algorithm takes a directed acyclic computation graph as input. A batch of multiple input graphs can be treated as a single disconnected graph. Source nodes are constant tensors, and non-source nodes are operations. Edges connect one of the outputs of a node to one of the inputs of another node. Scheduling is performed using a greedy algorithm: • Assign a depth to each node in the graph. Nodes with no dependencies (constants) are assigned depth zero. Nodes with only dependencies of depth zero are assigned depth one, nodes whose dependencies hav e a maximum depth of one get assigned depth two, etc. • Insert pass-through (identity) operations so that an operation at depth d + 1 only refers to results at depth d . • Batch together all nodes in voking the same operation at the same depth into a single node. • Concatenate all outputs which hav e the same depth and tensor type. The order of concate- nation corresponds to the order in which the dynamic batching operations were enumerated. • Assign a label ( d, t, i ) to each edge in the original graph, where d is the depth, t is the tensor type, and i is the integer index for that edge into the (concatenated) outputs for d , t . The schedule for the graph consists of the indices i for all edges, which are grouped together by depth and operation. In our T ensorFlow implementation, each dynamic operation is instantiated once in the static data-ﬂow graph. The inputs to each operation are tf.gather ops, and the outputs are fed into tf.concat ops, as described above. These T ensorFlow ops are then placed within a tf.while_loop . Each iteration of the loop will e valuate all of the operations at a particular depth. The loop maintains state variables for each tensor type t , and feeds the output of concat for tensor type t and iteration d into the input of the gather s at tensor type t and iteration d + 1 . The indices for gather at iteration d are drawn from the edge labels i for depth d in the schedule. The initial values for the state v ariables at iteration/depth 0 are the constants in the input graph. 2 Published as a conference paper at ICLR 2017 int [] state   % % float32 [128] state   & & ) ) 1   3   5   gather   gather   gather   gather   gather ~ ~ embed   embed   embed   embed lookup ' ' RNN Cell x x cell   concat   concat   cell int [] state float32 [128] state Figure 1: The static data-ﬂow graph created by dynamic batching for a binary T reeRNN over parse trees (left), and input graph corresponding to the parse tree (( w ord 1 , w or d 3 ) , w or d 5 ) (right). Dynamic batching allows us to construct a static T ensorFlow graph that contains a single instance of each operation, yet can emulate input graphs of arbitrary size and topology where operations may appear an arbitrary number of times. The T ensorFlow concat , gather , and while_loop ops are all differentiable, so gradients calculations and back-propagation do not require any additional code. For example, a binary T reeRNN as described above yields a T ensorFlow data-ﬂow graph with a tf.while_loop whose body is shown on the left of Figure 1. Here each gather has an additional input (the indices for the given op at the giv en depth) which picks out which elements the operations are to be called with. The long downw ard arrows are the pass-throughs. The algorithm consumes a tree such as the one sho wn on the right of Figure 1 and turns it into inputs for the gather operations at each depth (here depth is the loop counter for the tf.while_loop .) 2 . 1 E X P E R I M E N T A L R E S U L T S W e have implemented dynamic batching as part of a new library , T ensorFlow Fold, and designed a synthetic speed benchmark to compare it with manual batching in nativ e T ensorFlow . The bench- mark uses the same underlying kernels and e xecution engine in both cases. Nativ e T ensorFlow cannot batch together trees of different shapes so, for testing purposes, we use a batch of random binary trees, all of which have the same shape. These test results thus represent a best-case scenario, in which all operations can be batched together perfectly . For the manual batching tests, we con- struct a static data-ﬂow graph of operations corresponding to the shape of the tree. For the dynamic batching tests, we trav erse each tree to construct a schedule, as described above. The leaves of the tree are lookups into an embedding table, while the non-terminals implement a variant of the Tree-LSTM (T ai et al., 2015) equations. The tree size is 128, with a state size of 1024 for the LSTM. The CPU tests were run on a Dell z620 workstation with dual 8-core Intel Xeon processors (32 hardware threads), and the GPU tests were done using a consumer Nvidia GeForce GTX-1080 card. W e compare manual batching, dynamic batching where all trees ha ve the same shape, and dynamic batching where each tree has a different shape (the column marked “full dynamic”). There is no measurable penalty for dealing with trees of different shapes. The test results shown in T able 1 emphasize the importance of batching, especially on GPUs. T ensor- Flow will launch a GPU kernel for every node in the tree, so there is a ﬁxed overhead, proportional to the size of the tree, that dominates execution for small batch sizes. T ensorFlow does not begin to saturate the GPU until relati vely large batch sizes – 1024 or higher . The difference in speed between fully-batched and unbatched is ov er 160x. Dynamic batching has less kernel inv ocation overhead because the data-ﬂow graph is smaller . Dy- namic batching instantiates each operation only once, and in vokes it once for each depth, so the number of kernel in vocations is l og ( n ) , rather than n , where n is tree size. Dynamic batching thus achiev es substantial speedups even at batch size 1, because it batches operations at the same depth within a single tree. 3 Published as a conference paper at ICLR 2017 T able 1: Inference timing benchmark; times are wall-clock a verages in seconds batch-size manual dynamic full dynamic cost speedup batch tree batch tree batch tree ratio ratio (CPU) 1024 14.62 0.014 18.68 0.018 18.37 0.017 1.27 28.86 512 7.54 0.014 9.84 0.019 9.57 0.018 1.30 27.68 256 4.14 0.016 5.22 0.020 5.25 0.020 1.26 25.23 128 2.48 0.019 2.95 0.023 3.08 0.024 1.18 21.47 64 1.64 0.025 1.76 0.027 1.78 0.027 1.06 18.55 32 1.27 0.039 1.05 0.032 1.10 0.034 0.82 14.94 1 0.52 0.517 0.26 0.258 0.26 0.262 0.49 1.97 (GPU) 1024 0.978 0.0009 1.590 0.0015 1.617 0.0015 1.62 101.79 512 0.530 0.0010 0.715 0.0013 0.721 0.0014 1.34 114.15 256 0.312 0.0012 0.323 0.0012 0.340 0.0013 1.03 120.86 128 0.236 0.0018 0.164 0.0012 0.178 0.0013 0.69 115.05 64 0.193 0.0030 0.093 0.0014 0.106 0.0016 0.48 96.40 32 0.153 0.0047 0.061 0.0019 0.074 0.0023 0.40 68.79 1 0.161 0.1608 0.038 0.0376 0.036 0.0359 0.23 4.47 Howe ver , the extra concat and gather ops that dynamic batching inserts do have a cost. The “cost ratio” column above shows the ratio between dynamic and manual batching, in the case where all trees in the batch ha ve the same shape. The cost is only 20% for inference on GPUs with batch-size 1, b ut rises to 60% for training with backpropagation. The cost is mainly visible at lar ge batch sizes, because it is balanced by the beneﬁt of within-tree batching at smaller sizes. Even with the cost, dynamic batching yields a 120x speedup ov er using a batch size of 1 on GPU, and 28x on CPU. The “speedup ratio” column above shows the ratio between the per-tree time for dynamic batching on random shapes (“full dynamic”), versus manual batching with a batch size of 1. Note that using a batch size of 1 is not actually feasible for T ensorFlow , because T ensorFlow has a large graph construction ov erhead, which is not included in these measurements, but it may apply to other libraries that lack such ov erhead. 3 A C O M B I NA T O R L I B R A RY F O R N E U R A L N E T W O R K S In addition to dynamic batching, the T ensorFlo w F old library provides a set of combinators that simplify the task of constructing neural networks for DCGs. Our goal here is to show how dynamic batching enables implementing deep learning models (which are growing ever more complex) at a higher le vel of abstraction than manual batching. This in turn facilitates a more rapid feedback loop for trying out nov el model variants, and thus obtaining superior results. The design of the library was inspired by functional programming techniques such as parser combi- nators (Hutton & Meijer, 1996) and arr ows (Hughes, 2000). In a combinator library computations are structured compositionally , by plugging together simpler computations in various ways. The basic unit of computation in T ensorFlow F old is a bloc k , essentially a function from input to output. In a typical DCG model, the input is a graph or tree of some kind, and the output is a vector , which can be attached to a loss for training. For e xample, consider a model where the inputs are sequences of words, of varying lengths, and the output is a sentence vector . Our library pro vide se veral dif ferent ways of handling sequences. Giv en a simpler block f that operates on elements of the sequence, or g on pairs of elements, we deﬁne the following combinators: • Map( f ) : yields [ f ( x 1 ) , f ( x 2 ) , . . . f ( x n )] . Applies f to each element of the sequence, e.g. embedding each of the words of a sentence into R N . • Fold( g , z ) : yields g ( . . . g ( g ( z , x 1 ) , x 2 ) , . . . x n ) . Applies g sequentially in a leftw ard chain, e.g. running an RNN ov er a sequence. By default z = 0 . 4 Published as a conference paper at ICLR 2017 • Reduce( g ) : yields g ( Reduce ([ x 1 , . . . x b n/ 2 c ]) , Reduce ([ x b n/ 2 c +1 , . . . x n ])) . Applies g in a balanced tree, 2 e.g. max or sum-pooling ov er the elements. Note that it is not necessary to pad or truncate sequences to the same length; dynamic batching handles sequences of differing lengths. 3 . 1 T Y P E S Y S T E M Blocks are statically typed; each block has an input type and an output type. T ypes are inferred where possible, but must be e xplicitly speciﬁed in some cases. A type is one of the following: • Input denotes objects in the host language (Python), such as trees and dictionaries. • T ensor dtype , shape denotes tensors of a particular dtype and shape . 3 • T uple ( t 1 , . . . t n ) , denotes a tuple of values of types t 1 , . . . t n . • Se quenc e ( t ) , denotes a sequence of elements of type t , of any length. • V oid is the unit type. For example Se quenc e ( Se quenc e ( T uple ( T ensor float32 , [] , T ensor int8 , [ 3 , 4 ] ))) denotes jagged ar- rays whose elements are pairs ( float32 , int8 3 × 4 ) . 3 . 2 B L O C K S A N D C O M B I N A T O R S Blocks are composed hierarchically; a block expression is always a tree. The non-terminals in the tree are combinators such as Map and Fold , which take simpler blocks as arguments. The leav es of the tree are atomic blocks, which include the following: • Scalar : Input → T ensor Con vert a Python scalar to a tensor . • Tensor : Input → T ensor Con vert a NumPy array to a tensor . • Function( h ) : [ T ensor or T uple ( T ensor , . . . )] → [ T ensor or T uple ( T ensor , . . . )] Deﬁnes an operation h (see Section 2) over tensors. Operations with multiple inputs and outputs use tuples of tensors. • InputTransform( h ) : Input → Input Applies a user-deﬁned Python function h to pre-process the input. In addition to the the sequence combinators described abov e, important combinators in the library include the following: • b 1 >> b 2 : Function composition; the output of b 1 is fed to the input of b 2 . • Record( { l 1 : b 1 , . . . l n : b n } : Input → T uple ( t 1 , . . . t n ) T akes a Python dictionary or tuple as input, and applies each block b i to the ﬁeld labeled l i , to yield an object of type t i . Returns a tuple of the results for all ﬁelds. • OneOf( b 1 , . . . b n ) : Input → t Conditionally dispatches on its input to one of the blocks b 1 , . . . b n . • Optional( b ) : Input → t Applies b if the input is not None , otherwise returns zeros. A special case of OneOf . • AllOf( b 1 , . . . b n ) : t 0 → T uple ( t 1 , . . . t n ) Passes its input of type t 0 to each of the blocks b 1 , . . . b n , returning a tuple of results. 2 Reduce uses a balanced tree rather than a chain in order to minimize com putation depth and pro vide more opportunities for batching. 3 Note that the leading batch size for tensors is not part of the shape of the corresponding T ensor type. 5 Published as a conference paper at ICLR 2017 split / / w ord 2 v ec y y c expr # # a y p y r nn / / log its h A A / / e / / α ^ ^ w ord O O pair g g a x O O > > p x ` ` O O Figure 2: Block architectures for a pipeline (Section 3.3), feed-forward attention (Section 3.4), binary T ree-LSTMs (Section 3.5), and the weave module for molecule graphs (Section 3.6). 3 . 3 P I P E L I N E S Assume we have a set of ( text , label ) pairs as input and wish to predict the label from the text. The text consists of words, and we want to use an array of pretrained word embeddings ( word_matrix ) and corresponding dictionary mapping words to indices ( word_idx ). W e call word_idx.get(word) to obtain the index of word in word_matrix , or None if word is unkno wn. W e start by creating a block which embeds each word into a continuous space: word2vec = (InputTransform(word_idx.get) >> Optional(Scalar('int32')) >> Function(Embedding(initializer=word_matrix))) This block uses an InputTransform to get the index of a word, which is passed to an Optional block that con verts the scalar index to a tensor (or 0 if None ). This in turn gets passed to an Embedding operation, which performs a lookup into an embedding table. W ith word2vec in hand, we can deﬁne text2vec , which embeds sentences: split = InputTransform(str.split) rnn_cell = Concat() >> Function(FC(d, activation=tf.nn.relu)) text2vec = split >> Map(word2vec) >> Fold(rnn_cell, Zeros(d)) W e use an InputTransform to split the string into words. Then we map the words to vectors with word2vec , and combine the word vectors with a simple RNN, which uses a single fully connected layer FC with d hidden units. The Zeros block deﬁnes the initial state for the RNN. Assume there are n labels; we use a linear layer with n outputs to get unscaled logits: text2logits = text2vec >> Function(FC(n, activation=None)) For training, we create a Record block to con vert the label to a tensor as well, and calculate loss: record = Record([('text', text2logits), ('label', Scalar('int32'))]) loss = record >> Function(tf.nn.sparse_softmax_cross_entropy) Finally , we create a Compiler , which validates a block, performs type-checking, and sets up dy- namic batching in T ensorFlow . Outputs of a compiled block are av ailable as T ensorFlo w tensors, so training now proceeds as it w ould for any other T ensorFlow model: compiler = Compiler.create(loss) cross_entropy = Compiler.output_tensors[0] train_op = tf.train.AdamOptimizer().minimize(cross_entropy) 6 Published as a conference paper at ICLR 2017 3 . 4 C O M P L E X C O M P O S I T I O N S Recently , Raffel & Ellis (2016) ha ve introduced an attention model for feed-forward neural net- works. The model generalizes average-pooling and is deﬁned as: e t = a ( h t ) , α t = exp( e t ) P T k =1 exp( e k ) , c = T X t =1 α t h t (1) where a is a learnable function. In this model, the block architecture is not a simple pipeline (i.e. a composition using >> ) but instead forms a directed acyclic graph, as illustrated in Figure 2. A Composition block allows blocks to be composed into D A Gs. The model code and details may be found in Appendix A. 3 . 5 R E C U R S I V E D E FI N I T I O N S N -ary T ree-LSTMs (T ai et al., 2015, sec. 3.2) generalize LSTMs from 1 to N previous states. In T ai et al. (2015, sec. 5.1) they are applied to classify sentences from the Stanford Sentiment Treebank. This corpus consists of binarized constituency parse trees of one-sentence mo vie re views, where ev ery node has a sentiment label. At the leaves of the tree, words are mapped to word-embedding vectors which serv e as the input to a binary tree-LSTM with 0 for the pre vious states. At the internal nodes, the LSTM takes 0 as input, and pre vious states from its two children. More formally , h wor d = T reeLS T M ( E mbedding ( w or d ) , 0 , 0) (2) h lef t,rig ht = T reeLS T M (0 , h lef t , h rig ht ) (3) where T r eeLS T M ( x, h lef t , h rig ht ) is a learnable function corresponding to T ai et al. (2015) eqs. 9-14 with N = 2 . Since a tree is a recursiv e data type, a model that processes trees must be recursiv ely deﬁned, as illustrated by the cycle in Figure 2. A ForwardDeclaration allows the creation of recursiv e models: expr = ForwardDeclaration() word = AllOf(Record([('word', word2vec)]), Zeros((state_size, state_size)) pair = AllOf(Zeros(embedding_size), Record([('left', expr()), ('right', expr())])) expr_def = (OneOf(key_fn=len, case_blocks=[(1, word), (2, pair)]) >> TreeLSTM(state_size)) expr.resolve_to(expr_def) A forward declaration like expr is not itself a block, but may be called (using the expr() syntax) to create references – i.e. blocks which refer to the declaration. The subsequent call to resolve_to then updates all the references to refer to expr_def . The word2vec block is as deﬁned in Section 3.3. 3 . 5 . 1 E X P E R I M E N T A L R E S U L T S Here we brieﬂy report on some experiments with our implementation of N -ary T ree-LSTMs for sentiment analysis. While we set a ne w state-of-the-art, that is not really the point here. Our models are not particularly original, and could certainly be implemented without using T ensorFlow Fold. What Fold does is to enable simpler and more concise deﬁnitions (see T able 3), along with faster ex ecution, thus making it easier to rapidly explore nov el model variants. W e used constituency Tree-LSTMs with tuned Glov e vectors for word embedding, which achiev ed the best results of all sentiment models presented in T ai et al. (2015). In addition to this speciﬁc model, we hav e explored several novel variants. 4 In particular , T ai et al. (2015) employed non- 4 Unsuccessful variants included standard LSTMs (i.e. having only a single forget gate) accepting pooled histories from their children, and models based on character rather than word-le vel embeddings. 7 Published as a conference paper at ICLR 2017 T able 2: T est set accuracies on the Stanford Sentiment T reebank model ﬁne-grained binary T ai et al. (2015) 51.0 (0.5) 88.0 (0.3) Munkhdalai & Y u (2016a) 52.8 89.7 Munkhdalai & Y u (2016b) 53.1 89.3 Ours (Single Model) 52.3 (0.7) 89.4 (0.4) Ours (Ensemble) 53.6 90.2 T able 3: Lines of code comparison model ours original ratio Feed-Forward Attention 26 71 0.37 T ree-LSTM 119 219 0.54 Graph Con volutions 32 44 0.73 recurrent dropout and L2 weight re gularization. W e eliminated weight regularization in fa vor of the recurrent dropout scheme introduced by Semeniuta et al. (2016) and increased the LSTM state size from 150 to 300, leaving all other hyperparameters unchanged. Results are shown in T able 2, including the best pre viously reported results. Fine-grained accurac y is measured for all trees and calculated based on the ﬁ ve possible labels. Binary accuracy is measured only for trees with non-neutral sentiment, and is based on negati ve vs. positi ve classiﬁcation. The numbers in parentheses are standard deviations. T ai et al. (2015) report ﬁv e independent runs, our results are based on thirty independent runs. 5 Noting the small size of this dataset (8544/1101/2210 trees for train/dev/test), we further ev aluated an ensemble consisting of these thirty independently trained models; this variant sets a ne w state-of-the-art on both subtasks. 3 . 6 G R A P H C O N VO L U T I O N S As a ﬁnal example, we hav e used the Fold library to implement the graph conv olution model intro- duced by Kearnes et al. (2016) for molecules, which are represented as undirected graphs of atoms. The code is more complex than our previous examples because it in volves nested Composition blocks, and is giv en in Appendix B. 4 D I S C U S S I O N Neural architectures with dynamic computation graphs suffer from inefﬁcient batching and poor tooling. Dynamic batching solves the former problem in full generality , we believ e for the ﬁrst time. The SPINN architecture (Bowman et al., 2016) is an alternative stack-based approach that also en- ables ef ﬁcient batching with DCGs, but it is limited to binary trees, and requires padding/truncation to handle trees of different sizes. The F old library addresses the tooling problem by pro viding a high-lev el combinator library which is intended to make it easy for practitioners to rapidly dev elop and iterate on architectures with DCGs. The experimental results presented in section 2.1 quantify the impact of dynamic batching. The impact of the combinator library is harder to demonstrate quantitativ ely . One way to approach this (with a large grain of salt) is by comparing lines of code, which we do in T able 3, vs. the original author’ s sources. See Appendix C for details on the comparison protocol. Of course, a very short implementation is suboptimal if it comes at the cost of ﬂexibility . The results in Section 3.5.1 show that models from the literature can be reimplemented in Fold, then extended to achiev e superior performance. W e suspect that other models with DCGs will ha ve quite a bit of “head room” as well, due to simply having less w ork done tuning them compared with more mainstream architectures. 5 Munkhdalai & Y u (2016a;b) do not report standard deviations or number of runs. 8 Published as a conference paper at ICLR 2017 R E F E R E N C E S Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jef frey Dean, Matthieu Devin, et al. T ensorFlo w: Large-scale machine learning on heterogeneous systems, 2015. arXiv , 1603.04467, 2016. Jacob Andreas, Marcus Rohrbach, T rev or Darrell, and Dan Klein. Learning to compose neural networks for question answering. In NAA CL , 2016. Dzmitry Bahdanau, K yunghyun Cho, and Y oshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR , 2015. Anna Maria Bianucci, Alessio Micheli, Alessandro Sperduti, and Antonina Starita. Application of cascade correlation networks for structures to chemistry . Applied Intellig ence , 2000. Samuel R. Bowman, Jon Gauthier , Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. A f ast uniﬁed model for parsing and sentence understanding. In NAA CL , 2016. Ronan Collobert, Koray Kavukcuoglu, and Cl ´ ement Farabet. T orch7: A Matlab-like environment for machine learning. In BigLearn, NIPS W orkshop , 2011. Christoph Goller and Andreas Kuchler . Learning task-dependent distributed representations by backpropagation through structure. In ICNN , 1996. John Hughes. Generalising monads to arro ws. Science of Computer Pr ogramming , 2000. Graham Hutton and Erik Meijer . Monadic parser combinators. T echnical Report NO TTCS-TR-96-4 , 1996. Stev en Kearnes, Ke vin McCloskey , Marc Berndl, V ijay Pande, and Patrick Riley . Molecular graph con volutions: moving beyond ﬁngerprints. Journal of Computer -Aided Molecular Design , 2016. Alex Krizhevsk y , Ilya Sutskev er , and Geoffre y E Hinton. Imagenet classiﬁcation with deep conv o- lutional neural networks. In NIPS , 2012. Jiwei Li, Minh-Thang Luong, Dan Jurafsky , and Eudard Hovy . When are tree structures necessary for deep learning of representations? arXiv , 1503.00185, 2015. Tsendsuren Munkhdalai and Hong Y u. Neural semantic encoders. arXiv , 1607.04315, 2016a. Tsendsuren Munkhdalai and Hong Y u. Neural tree index ers for text understanding. arXiv , 1607.04492, 2016b. Kyoung-Su Oh and Keechul Jung. GPU implementation of neural networks. P attern Recognition , 2004. Jordan B Pollack. Recursi ve distrib uted representations. Artiﬁcial Intelligence , 1990. Colin Raffel and Daniel PW Ellis. Feed-forward networks with attention can solve some long-term memory problems. In ICLR (W orkshop T rack) , 2016. Stanislau Semeniuta, Aliaksei Se veryn, and Erhardt Barth. Recurrent dropout without memory loss. arXiv , 1603.05118, 2016. Kai Sheng T ai, Richard Socher , and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. In NAA CL , 2015. Theano Dev elopment T eam. Theano: A Python framework for fast computation of mathematical expressions. arXiv , 1605.02688, 2016. V incent V anhoucke, Andrew Senior , and Mark Z. Mao. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised F eature Learning , NIPS W orkshop , 2011. 9 Published as a conference paper at ICLR 2017 A F E E D - F O RW A R D A T T E N T I O N The feed-forward attention model from Section 3.4 may be implemented in Fold as follo ws: attention = Composition() with attention.scope(): h = attention.input exp_e = Map(a >> Function(tf.exp)).reads(h) z = (Sum() >> Broadcast()).reads(exp_e) alpha = ZipWith(Function(tf.div)).reads(exp_e, z) c = (ZipWith(Function(tf.mul)) >> Sum()).reads(alpha, h) attention.output.reads(c) W ithin a composition scope , blocks may be wired together with reads , provided no directed cycles are formed. The input and output properties are used to deﬁne the overall inputs and outputs of the composition block. This example introduces se veral additional block types: • Sum is a specialization of Reduce that performs elementwise addition. • ZipWith is a variant of Map that accepts n sequences as input and applies an n -ary function f elementwise (stopping when the end of the shortest input sequence is reached). • Broadcast creates a Se quenc e ( t ) from a single t , repeating the same element endlessly . B G R A P H C O N V O L U T I O N S This section implements the graph con volution model introduced by Kearnes et al. (2016), for molecules represented as undirected graphs of atoms. There are real-valued feature vectors for each atom and for each distinct pair of atoms. For a molecule having N atoms, we index its atom feature vectors as a i ∈ R n for 1 ≤ i ≤ N . W e index its pair feature vectors as p i,j ∈ R m for 1 ≤ i, j ≤ N , where p i,j = p j,i and p i,i = 0 . The core of the graph con volution model is the weave module, which combines atom-level and pair-le vel features using six learnable functions (typically fully connected ReLU layers). The wea ve module can be stacked arbitrarily to create deep graph con volution models. Denoting inputs and outputs by x and y superscripts respectiv ely , the weav e module is: a y i = f A ( f A → A ( a x i ) , N X j =1 f P → A ( p x i,j )) (4) p y i,j = f P ( f A → P ( a x i , a x j ) + f A → P ( a x j , a x i ) , f P → P ( p x i,j )) (5) where f A , f P , f A → A , f A → P , f P → A and f P → P are learnable functions. It is note worthy that the a x → p y calculation in volves a nested scan over the atoms; for each a i we must calculate f A → P ( a x i , a x j ) + f A → P ( a x j , a x i ) for all 1 ≤ j ≤ N : a_i_to_p = Composition() with a_i_to_p.scope(): a_x_i = Broadcast().reads(a_i_to_p.input[0]) a_x = a_i_to_p.input[1] f_i_j = ZipWith(Concat() >> f_a_p).reads(a_x_i, a_x) f_j_i = ZipWith(Concat() >> f_a_p).reads(a_x, a_x_i) p = ZipWith(Sum()).reads(f_i_j, f_j_i) a_i_to_p.output.reads(p) The input to the a_i_to_p composition block is ( a x i , a x ) . It has the type T uple ( T ensor float32 , [n] , Se quenc e ( T ensor float32 , [n] )) . W e broadcast a x i ov er a x twice in succession to compute f A → P ( a x i , a x j ) and f A → P ( a x j , a x i ) for all 1 ≤ j ≤ N , yielding f_i_j and f_j_i , which are length- n sequences of vectors. W e join and sum 10 Published as a conference paper at ICLR 2017 each of these vectors elementwise to obtain the ultimate output of the block, which is also a length- n sequence of vectors. The overall wea ve module may no w be implemented as follows: weave = Composition() with weave.scope(): a_x = weave.input[0] p_x = weave.input[1] a_to_a = Map(f_a_a).reads(a_x) p_to_a = Map(Map(f_p_a) >> Sum()).reads(p_x) a_y = ZipWith(Concat() >> f_a).reads(a_to_a, p_to_a) a_to_p = ZipWith(a_i_to_p).reads(a_x, Broadcast().reads(a_x)) p_to_p = Map(Map(f_p_p)).reads(p_x) p_y = ZipWith(ZipWith(Concat() >> f_p)).reads(a_to_p, p_to_p) weave.output.reads(a_y, p_y) The input to weave is ( a x , p x ) . It has the type T uple ( Se quenc e ( T ensor float32 , [n] ) , Se quenc e ( Se quenc e ( T ensor float32 , [m] ))) . The calculation may be understood as follows: • a_to_a maps o ver a x with f A → A , going from Se quenc e ( T ensor ) to Se quenc e ( T ensor ) . • p_to_a maps over p x with f A → P and sums along the inner dimension, reducing from Se quenc e ( Se quenc e ( T ensor )) to Se quenc e ( T ensor ) . • a_y zips a_to_a and p_to_a with f A , going from T uple ( Se quenc e ( T ensor ) , Se quenc e ( T ensor )) to Se quenc e ( T ensor ) . • a_to_p broadcasts a x ov er itself with a_i_to_p , expanding from Se quenc e ( T ensor ) to Se quenc e ( Se quenc e ( T ensor )) . • p_to_p maps o ver p x with f P → P , going from Se quenc e ( Se quenc e ( T ensor )) to Se quenc e ( Se quenc e ( T ensor )) . • p_y zips a_to_p and p_to_p with f P , going from T uple ( Se quenc e ( Se quenc e ( T ensor )) , Se quenc e ( Se quenc e ( T ensor ))) to Se quenc e ( Se quenc e ( T ensor )) . C C A L C U L AT I N G L I N E S O F C O D E Our protocol for calculating lines 6 of code is as follows: • Deﬁne the functional unit of comparison as an input-output mapping. • Prepare a single ﬁle that implements this functionality and nothing else. • Remo ve import statements, abstract base classes, logging, ﬁle i/o, and validation logic. • Count lines of code, ignoring blank lines and comments. 7 . F E E D - F O R W A R D A T T E N T I O N The functional unit of comparison is creating the model for the variable-length experiment described in Raffel & Ellis (2016, sec. 2.3). This includes the loss and accuracy calculations, but does not include the training loop or the creation of training data. The original implementation 8 is in Python and uses Theano and Lasagne. The T ensorFlow Fold implementation is more concise, partly due to differences between T ensorFlow and Lasagne. Fold itself reduces implementation comple xity by eliminating the need for manual batching, e.g. x.sum(axis=1) where batching is e xplicit o ver axis 0 , vs. x >> Sum() , which is implicitly batched. 6 All of the implementations we examine are formatted with 80-column lines excepting the T ree-LSTM implementation, which has a few lines that are slightly longer; we still count these as single lines. 7 The calculations were performed with cloc ( https://github.com/AlDanial/cloc ). 8 Commit e8fce3e from https://github.com/craffel/ff-attention . 11 Published as a conference paper at ICLR 2017 T R E E - L S T M The functional unit of comparison is creating a (binary) constituency T ree-LSTM and running an epoch of training for the ﬁne-grained sentiment classiﬁcation task as described in T ai et al. (2015, sec. 5.1). This does not include loading the word embeddings or dataset, which are pro vided as inputs. The original implementation 9 is in Lua and uses T orch. Lua terminates blocks with the end keyw ord; we do not count these lines. Here, the use of Python and T ensorFlow leads to substantially more concise code than with Lua and T orch. Unlike the previous example manual batching plays no role here, because the original implementation computes gradients and losses one tree at a time. Fold reduces complexity here by using a OneOf block to distinguish between leaves and internal nodes, rather than a recursiv e function that explicitly tra verses the tree. G R A P H C O N VO L U T I O N The functional unit of comparison is creating a single weave module as described in K earnes et al. (2016, sec. 3.3). The original implementation 10 is in Python and uses T ensorFlow . Here, both implementations use the same language and deep learning library . Fold helps by eliminat- ing the need for manual batching, as in the ﬁrst e xample. This is particularly apparent in the atoms-to-pairs calculation, which requires making n “copies” of an n × d matrix x to get an n × n × d tensor . In nativ e T ensorFlow the ﬁrst dimension is batch, and the copying is explicit, as reshape(tile(x, [1, n, 1]), [batch_size, n, n, d]) . In Fold, x >> Broadcast() sufﬁces, because the number of copies needed is determined lazily by subsequent computations. 9 Commit b02ad49 from https://github.com/stanfordnlp/treelstm . 10 Provided by K earnes et al. (2016). 12

Deep Learning with Dynamic Computation Graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment