Hierarchical Memory Networks

Hierar chical Memory Networks Sarath Chandar ∗ 1 , Sungjin Ahn 1 , Hugo Larochelle 2 , 4 , Pascal V incent 1 , 4 , Gerald T esauro 3 , Y oshua Bengio 1 , 4 1 Univ ersité de Montréal, Canada. 2 T witter Cortex, USA. 3 IBM W atson Research Center , USA. 4 CIF AR, Canada. Abstract Memory networks are neural networks with an e xplicit memory component that can be both read and written to by the network. The memory is often addressed in a soft way using a softmax function, making end-to-end training with backpropagation possible. Howe v er , this is not computationally scalable for applications which require the network to read from extremely lar ge memories. On the other hand, it is well known that hard attention mechanisms based on reinforcement learning are challenging to train successfully . In this paper , we e xplore a form of hierarchical memory network, which can be considered as a hybrid between hard and soft attention memory networks. The memory is organized in a hierarchical structure such that reading from it is done with less computation than soft attention ov er a ﬂat memory , while also being easier to train than hard attention ov er a ﬂat memory . Speciﬁcally , we propose to incorporate Maximum Inner Product Search (MIPS) in the training and inference procedures for our hierarchical memory network. W e explore the use of various state-of-the art approximate MIPS techniques and report results on SimpleQuestions, a challenging large scale factoid question answering task. 1 Introduction Until recently , traditional machine learning approaches for challenging tasks such as image captioning, object detection, or machine translation ha ve consisted in complex pipelines of algorithms, each being separately tuned for better performance. W ith the recent success of neural networks and deep learning research, it has now become possible to train a single model end-to-end, using backpropagation. Such end-to-end systems often outperform traditional approaches, since the entire model is directly optimized with respect to the ﬁnal task at hand. Ho wever , simple encode-decode style neural netw orks often underperform on knowledge-based reasoning tasks like question-answering or dialog systems. Indeed, in such cases it is nearly impossible for regular neural networks to store all the necessary knowledge in their parameters. Neural networks with memory [ 1 , 2 ] can deal with knowledge bases by ha ving an external memory component which can be used to explicitly store knowledge. The memory is accessed by reader and writer functions, which are both made differentiable so that the entire architecture (neural network, reader , writer and memory components) can be trained end-to-end using backpropagation. Memory-based architectures can also be considered as generalizations of RNNs and LSTMs, where the memory is analogous to recurrent hidden states. Howe ver they are much richer in structure and can handle very long-term dependencies because once a vector (i.e., a memory) is stored, it is copied ∗ Corresponding author: apsarathchandar@gmail.com from time step to time step and can thus stay there for a v ery long time (and gradients correspondingly ﬂow back time unhampered). There exists se veral variants of neural networks with a memory component: Memory Networks [ 2 ], Neural T uring Machines (NTM) [ 1 ], Dynamic Memory Networks (DMN) [ 3 ]. They all share ﬁv e major components: memory , input module, reader, writer , and output module. Memory: The memory is an array of cells, each capable of storing a v ector . The memory is often initialized with external data (e.g. a database of facts), by ﬁlling in its cells with a pre-trained vector representations of that data. Input module: The input module is to compute a representation of the input that can be used by other modules. Writer: The writer takes the input representation and updates the memory based on it. The writer can be as simple as ﬁlling the slots in the memory with input v ectors in a sequential way (as often done in memory networks). If the memory is bounded, instead of sequential writing, the writer has to decide where to write and when to rewrite cells (as often done in NTMs). Reader: Giv en an input and the current state of the memory , the reader retriev es content from the memory , which will then be used by an output module. This often requires comparing the input’ s representation or a function of the recurrent state with memory cells using some scoring function such as a dot product. Output module: Gi ven the content retriev ed by the reader , the output module generates a prediction, which often takes the form of a conditional distrib ution ov er multiple labels for the output. For the rest of the paper , we will use the name memory network to describe any model which has any form of these ﬁ ve components. W e would like to highlight that all the components except the memory are learnable. Depending on the application, any of these components can also be ﬁxed. In this paper , we will focus on the situation where a network does not write and only reads from the memory . In this paper , we focus on the application of memory networks to large-scale tasks. Speciﬁcally , we focus on large scale factoid question answering. For this problem, giv en a large set of facts and a natural language question, the goal of the system is to answer the question by retrieving the supporting fact for that question, from which the answer can be deriv ed. Application of memory networks to this task has been studied in [ 4 ]. Ho wev er, [ 4 ] depended on keyw ord based heuristics to ﬁlter the facts to a smaller set which is manageable for training. Howe ver heuristics are in variably dataset dependent and we are interested in a more general solution which can be used when the facts are of an y structure. One can design soft attention retrie v al mechanisms, where a con ve x combination of all the cells is retrieved or design hard attention retriev al mechanisms where one or few cells from the memory are retriev ed. Soft attention is achiev ed by using softmax over the memory which makes the reader differentiable and hence learning can be done using gradient descent. Hard attention is achiev ed by using methods like REINFORCE [ 5 ], which provides a noisy gradient estimate when discrete stochastic decisions are made by a model. Both soft attention and hard attention hav e limitations. As the size of the memory grows, soft attention using softmax weighting is not scalable. It is computationally very e xpensiv e, since its complexity is linear in the size of the memory . Also, at initialization, gradients are dispersed so much that it can reduce the ef fectiv eness of gradient descent. These problems can be alleviated by a hard attention mechanism, for which the training method of choice is REINFORCE. Ho wev er, REINFORCE can be brittle due to its high variance and existing v ariance reduction techniques are complex. Thus, it is rarely used in memory networks (e ven in cases of a small memory). In this paper , we propose a ne w memory selection mechanism based on Maximum Inner Product Search (MIPS) which is both scalable and easy to train. This can be considered as a hybrid of soft and hard attention mechanisms. The key idea is to structure the memory in a hierarchical way such that it is easy to perform MIPS, hence the name Hierarchical Memory Netw ork (HMN). HMNs are scalable at both training and inference time. The main contrib utions of the paper are as follows: • W e explore hierarchical memory networks, where the memory is organized in a hierarchical fashion, which allo ws the reader to ef ﬁciently access only a subset of the memory . • While there are sev eral ways to decide which subset to access, we propose to pose memory access as a maximum inner product search (MIPS) problem. 2 • W e empirically show that exact MIPS-based algorithms not only enjoy similar con ver gence as soft attention models, but can e ven improv e the performance of the memory network. • Since exact MIPS is as computationally expensi ve as a full soft attention model, we propose to train the memory networks using approximate MIPS techniques for scalable memory access. • W e empirically show that unlike exact MIPS, approximate MIPS algorithms provide a speedup and scalability of training, though at the cost of some performance. 2 Hierarchical Memory Networks In this section, we describe the proposed Hierarchical Memory Network (HMN). In this paper , HMNs only differ from re gular memory networks in two of its components: the memory and the reader . Memory: Instead of a ﬂat array of cells for the memory structure, HMNs le verages a hierarchical memory structure. Memory cells are org anized into groups and the groups can further be organized into higher level groups. The choice for the memory structure is tightly coupled with the choice of reader, which is essential for fast memory access. W e consider three classes of approaches for the memory’ s structure: hashing-based approaches, tree-based approaches, and clustering-based approaches. This is explained in detail in the ne xt section. Reader: The reader in the HMN is different from the readers in ﬂat memory networks. Flat memory- based readers use either soft attention over the entire memory or hard attention that retrie ves a single cell. While these mechanisms might work with small memories, with HMNs we are more interested in achieving scalability tow ards very large memories. So instead, HMN readers use soft attention only ov er a selected subset of the memory . Selecting memory subsets is guided by a maximum inner product search algorithm, which can exploit the hierarchical structure of the or ganized memory to retrie ve the most relev ant facts in sub-linear time. The MIPS-based reader is explained in more detail in the next section. In HMNs, the reader is thus trained to create MIPS queries such that it can retriev e a suf ﬁcient set of facts. While most of the standard applications of MIPS [ 6 – 8 ] so far ha ve focused on settings where both query vector and database (memory) vectors are precomputed and ﬁxed, memory readers in HMNs are learning to do MIPS by updating the input representation such that the result of MIPS retriev al contains the correct fact(s). 3 Memory Reader with K -MIPS attention In this section, we describe how the HMN memory reader uses Maximum Inner Product Search (MIPS) during learning and inference. W e begin with a formal deﬁnition of K -MIPS. Giv en a set of points X = { x 1 , . . . , x n } and a query vector q , our goal is to ﬁnd argmax ( K ) i ∈X q > x i (1) where the argmax ( K ) returns the indices of the top- K maximum values. In the case of HMNs, X corresponds to the memory and q corresponds to the vector computed by the input module. A simple b ut inefﬁ cient solution for K -MIPS in volves a linear search ov er the cells in memory by performing the dot product of q with all the memory cells. While this will return the exact result for K -MIPS, it is too costly to perform when we deal with a large-scale memory . Howe ver , in many practical applications, it is often suf ﬁcient to hav e an approximate result for K -MIPS, trading speed-up at the cost of the accuracy . There exist se veral approximate K -MIPS solutions in the literature [8, 9, 7, 10]. All the approximate K -MIPS solutions add a form of hierarchical structure to the memory and visit only a subset of the memory cells to ﬁnd the maximum inner product for a gi ven query . Hashing-based approaches [ 8 – 10 ] hash cells into multiple bins, and given a query they search for K -MIPS cell vectors only in bins that are close to the bin associated with the query . T ree-based approaches [ 6 , 7 ] create search trees with cells in the leav es of the tree. Given a query , a path in the tree is follo wed and MIPS is performed only for the leaf for the chosen path. Clustering-based approaches [ 11 ] cluster 3 cells into multiple clusters (or a hierarchy of clusters) and gi ven a query , they perform MIPS on the centroids of the top few clusters. W e refer the readers to [ 11 ] for an extensi ve comparison of v arious state-of-the-art approaches for approximate K -MIPS. Our proposal is to exploit this rich approximate K -MIPS literature to achie ve scalable training and inference in HMNs. Instead of ﬁltering the memory with heuristics, we propose to organize the memory based on approximate K -MIPS algorithms and then train the reader to learn to perform MIPS. Speciﬁcally , consider the following softmax ov er the memory which the reader has to perform for ev ery reading step to retrieve a set of rele vant candidates: R out = softmax( h ( q ) M T ) (2) where h ( q ) ∈ R d is the representation of the query , M ∈ R N × d is the memory with N being the total number of cells in the memory . W e propose to replace this softmax with softmax ( K ) which is deﬁned as follows: C = argmax ( K ) h ( q ) M T (3) R out = softmax ( K ) ( h ( q ) M T ) = softmax( h ( q ) M [ C ] T ) (4) where C is the indices of top- K MIP candidate cells and M [ C ] is a sub-matrix of M where the rows are index ed by C . One adv antage of using the softmax ( K ) is that it naturally focuses on cells that would normally receiv e the strongest gradients during learning. That is, in a full softmax, the gradients are otherwise more dispersed across cells, gi ven the lar ge number of cells and despite man y contributing a small gradient. As our experiments will sho w , this results in slower training. One problematic situation when learning with the softmax ( K ) is when we are at the initial stages of training and the K -MIPS reader is not including the correct f act candidate. T o av oid this issue, we always include the correct candidate to the top- K candidates retriev ed by the K -MIPS algorithm, effecti vely performing a fully supervised form of learning. During training, the reader is updated by backpropagation from the output module, through the subset of memory cells. Additionally , the log-likelihood of the correct fact computed using K -softmax is also maximized. This second supervision helps the reader learn to modify the query such that the maximum inner product of the query with respect to the memory will yield the correct supporting fact in the top K candidate set. Until no w , we described the exact K -MIPS-based learning frame work, which still requires a linear look-up ov er all memory cells and would be prohibitiv e for large-scale memories. In such scenarios, we can replace the e xact K -MIPS in the training procedure with the approximate K -MIPS. This is achiev ed by deploying a suitable memory hierarchical structure. The same approximate K -MIPS- based reader can be used during inference stage as well. Of course, approximate K -MIPS algorithms might not return the exact MIPS candidates and will likely to hurt performance, but at the beneﬁt of achieving scalability . While the memory representation is ﬁx ed in this paper , updating the memory along with the query representation should improve the likelihood of choosing the correct fact. Howe ver , updating the memory will reduce the precision of the approximate K -MIPS algorithms, since all of them assume that the vectors in the memory are static. Designing ef ﬁcient dynamic K -MIPS should improv e the performance of HMNs ev en further, a challenge that we hope to address in future w ork. 3.1 Reader with Clustering-based approximate K -MIPS Clustering-based approximate K -MIPS was proposed in [ 11 ] and it has been sho wn to outperform various other state-of-the-art data dependent and data independent approximate K -MIPS approaches for inference tasks. As we will sho w in the e xperiments section, clustering-based MIPS also performs better when used to training HMNs. Hence, we focus our presentation on the clustering-based approach and propose changes that were found to be helpful for learning HMNs. Follo wing most of the other approximate K -MIPS algorithms, [ 11 ] con verts MIPS to Maximum Cosine Similarity Search (MCSS) problem: argmax ( K ) i ∈X q T x i || q || || x i || = argmax ( K ) i ∈X q T x i || x i || (5) 4 When all the data vectors x i hav e the same norm, then MCSS is equi v alent to MIPS. Howe ver , it is often restrictive to hav e this additional constraint. Instead, [ 11 ] appends additional dimensions to both query and data vectors to con vert MIPS to MCSS. In HMN terminology , this would correspond to adding a few more dimensions to the memory cells and input representations. The algorithm introduces two hyper -parameters, U < 1 and m ∈ N ∗ . The ﬁrst step is to scale all the vectors in the memory by the same factor , such that max i || x i || 2 = U . W e then apply two mappings, P and Q , on the memory cells and on the input vector , respecti vely . These two mappings simply concatenate m new components to the v ectors and make the norms of the data points all roughly the same [9]. The mappings are deﬁned as follo ws: P ( x ) = [ x, 1 / 2 − || x || 2 2 , 1 / 2 − || x || 4 2 , . . . , 1 / 2 − || x || 2 m 2 ] (6) Q ( x ) = [ x, 0 , 0 , . . . , 0] (7) W e thus have the follo wing approximation of MIPS by MCSS for any query vector q : argmax ( K ) i q > x i ' argmax ( K ) i Q ( q ) > P ( x i ) || Q ( q ) || 2 · || P ( x i ) || 2 (8) Once we con vert MIPS to MCSS, we can use spherical K -means [ 12 ] or its hierarchical v ersion to approximate and speedup the cosine similarity search. Once the memory is clustered, then ev ery read operation requires only K dot-products, where K is the number of cluster centroids. Since this is an approximation, it is error-prone. As we are using this approximation for the learning process, this introduces some bias in gradients, which can af fect the ov erall performance of HMN. T o alleviate this bias, we propose three simple strate gies. • Instead of using only the top- K candidates for a single read query , we also add top- K candidates retriev ed for e very other read query in the mini-batch. This serves tw o purposes. First, we can do efﬁcient matrix multiplications by leveraging GPUs since all the K -softmax in a minibatch are o ver the same set of elements. Second, this also helps to decrease the bias introduced by the approximation error . • For e very read access, instead of only using the top few clusters which has a maximum product with the read query , we also sample some clusters from the rest, based on a probability distribution log-proportional to the dot product with the cluster centroids. This also decreases the bias. • W e can also sample random blocks of memory and add it to top- K candidates. W e empirically in vestigate the ef fect of these variations in Section 5.5. 4 Related W ork Memory networks hav e been introduced in [ 2 ] and hav e been so far applied to comprehension-based question answering [ 13 , 14 ], large scale question answering [ 4 ] and dialogue systems [ 15 ]. While [ 2 ] considered supervised memory networks in which the correct supporting fact is given during the training stage, [ 14 ] introduced semi-supervised memory networks that can learn the supporting fact by itself. [ 3 , 16 ] introduced Dynamic Memory Networks (DMNs) which can be considered as a memory network with two types of memory: a regular large memory and an episodic memory . Another related class of model is the Neural T uring Machine [ 1 ], which is uses softmax-based soft attention. Later [ 17 ] extended NTM to hard attention using reinforcement learning. [ 15 , 4 ] alleviate the problem of the scalability of soft attention by ha ving an initial ke yword based ﬁltering stage, which reduces the number of facts being considered. Our work generalizes this ﬁltering by using MIPS for ﬁltering. This is desirable because MIPS can be applied for any modality of data or e ven when there is no ov erlap between the words in a question and the words in f acts. The softmax arises in various situations and most rele vant to this work are scaling methods for large vocab ulary neural language modeling. In neural language modeling, the ﬁnal layer is a softmax distribution over the ne xt word and there e xist se veral approaches to achiev e scalability . [ 18 ] proposes a hierarchical softmax based on prior clustering of the words into a binary , or more generally n -ary tree, that serves as a ﬁxed structure for the learning process of the model. The complexity of training 5 is reduced from O ( n ) to O (log n ) . Due to its clustering and tree structure, it resembles the clustering- based MIPS techniques we e xplore in this paper . Ho wever , the approaches dif fer at a fundamental le vel. Hierarchical softmax deﬁnes the probability of a leaf node as the product of all the probabilities computed by all the intermediate softmaxes on the w ay to that leaf node. By contrast, an approximate MIPS search imposes no such constraining structure on the probabilistic model, and is better thought as ef ﬁciently searching for top winners of what amounts to be a lar ge ordinary ﬂat softmax. Other methods such as Noice Constrastiv e Estimation [ 19 ] and Negati ve Sampling [ 20 ] av oid an expensi ve normalization constant by sampling negati ve samples from some marginal distribution. By contrast, our approach approximates the softmax by explicitly including in its negati ve samples candidates that likely would ha ve a large softmax v alue. [ 21 ] introduces an importance sampling approach that considers all the words in a mini-batch as the candidate set. This in general might also not include the MIPS candidates with highest softmax values. [ 22 ] is the only work that we kno w of, proposing to use MIPS during learning. It proposes hashing- based MIPS to sort the hidden layer activ ations and reduce the computation in e very layer . Howe ver , a small scale application w as considered and data-independent methods lik e hashing will likely suf fer as dimensionality increases. 5 Experiments In this section, we report experiments on factoid question answering using hierarchical memory networks. Speciﬁcally , we use the SimpleQuestions dataset [ 4 ]. The aim of these experiments is not to achiev e state-of-the-art results on this dataset. Rather , we aim to propose and analyze various approaches to make memory networks more scalable and explore the achiev ed tradeoffs between speed and accuracy . 5.1 Dataset W e use SimpleQuestions [ 4 ] which is a large scale f actoid question answering dataset. SimpleQues- tions consists of 108,442 natural language questions, each paired with a corresponding fact from Freebase. Each f act is a triple (subject,relation,object) and the answer to the question is always the ob- ject. The dataset is divided into training (75910), v alidation (10845), and test (21687) sets. Unlike [ 4 ] who additionally considered FB2M (10M facts) or FB5M (12M facts) with keyword-based heuristics for ﬁltering most of the f acts for each question, we only use SimpleQuestions, with no keyw ord-based heuristics. This allows us to do a direct comparison with the full softmax approach in a reasonable amount of time. Moreov er , we would like to highlight that for this dataset, keyword-based ﬁltering is a very ef ﬁcient heuristic since all questions have an appropriate source entity with a matching word. Ne vertheless, our goal is to design a general purpose architecture without such strong assumptions on the nature of the data. 5.2 Model Let V q be the vocab ulary of all words in the natural language questions. Let W q be a | V q | ∗ m matrix where each row is some m dimensional embedding for a word in the question vocab ulary . This matrix is initialized with random v alues and learned during training. Given any question, we represent it with a bag-of-words representation by summing the vector representation of each w ord in the question. Let q = { w i } p i =1 , h ( q ) = p X i =1 W q [ w i ] Then, to ﬁnd the rele v ant fact from the memory M, we call the K -MIPS-based reader module with h ( q ) as the query . This uses Equation 3 and 4 to compute the output of the reader R out . The reader is trained by minimizing the Negati ve Log Likelihood (NLL) of the correct fact. J θ = N X i =1 − log ( R out [ f i ]) 6 where f i is the index of the correct fact in W m . W e are ﬁxing the memory embeddings to the T ransE [23] embeddings and learning only the question embeddings. This model is simpler than the one reported in [ 4 ] so that it is esay to analyze the ef fect of various memory reading strategies. 5.3 T raining Details W e trained the model with the Adam optimizer [ 24 ], with a ﬁxed learning rate of 0.001. W e used mini-batches of size 128. W e used 200 dimensional embeddings for the T ransE entities, yielding 600 dimensional embeddings for facts by concatenating the embeddings of the subject, relation and object. W e also experimented with summing the entities in the triple instead of concatenating, but we found that it was difﬁcult for the model to dif ferentiate facts this way . The only learnable parameters by the HMN model are the question word embeddings. The entity distribution in SimpleQuestions is extremely sparse and hence, follo wing [ 4 ], we also add artiﬁcial questions for all the facts for which we do not hav e natural language questions. Unlike [ 4 ], we do not add any other additional tasks like paraphrase detection to the model, mainly to study the ef fect of the reader . W e stopped training for all the models when the validation accurac y consistently decreased for 3 epochs. 5.4 Exact K -MIPS improves accuracy In this section, we compare the performance of the full soft attention reader and exact K -MIPS attention readers. Our goal is to verify that K -MIPS attention is in fact a valid and useful attention mechanism and see ho w it fares when compared to full soft attention. F or K -MIPS attention, we tried K ∈ 10 , 50 , 100 , 1000 . W e would like to emphasize that, at training time, along with K candidates for a particular question, we also add the K -candidates for each question in the mini-batch. So the exact size of the softmax layer would be higer than K during training. In T able 1, we report the test performance of memory networks using the soft attention reader and K -MIPS attention reader . W e also report the av erage softmax size during training. From the table, it is clear that the K -MIPS attention readers improve the performance of the network compared to soft attention reader . In fact, smaller the value of K is, better the performance. This result suggests that it is better to use a K -MIPS layer instead of softmax layer whene ver possible. It is interesting to see that the con vergence of the model is not slo wed down due to this change in softmax computation (as sho wn in Figure 1). Model T est Acc. A vg. Softmax Size Full-softmax 59.5 108442 10-MIPS 62.2 1290 50-MIPS 61.2 6180 100-MIPS 60.6 11928 1000-MIPS 59.6 70941 Clustering 51.5 20006 PCA-T ree 32.4 21108 WT A-Hash 40.2 20008 T able 1: Accuracy in SQ test-set and av erage size of memory used. 10-softmax has high performance while using only smaller amount of memory . 0 5 10 15 20 25 Epochs 30 40 50 60 70 80 90 Val Error softmax 10-softmax 50-softmax 100-softmax 1000-softmax Figure 1: V alidation curve for v arious models. Con vergence is not slo wed down by k-softmax. This experiment conﬁrms the usefulness of K -MIPS attention. Howe ver , exact K -MIPS has the same complexity as a full softmax. Hence, to scale up the training, we need more efﬁcient forms of K -MIPS attention, which is the focus of next experiment. 5.5 Appr oximate K -MIPS based learning As mentioned previously , designing faster algorithms for K -MIPS is an acti ve area of research. [ 11 ] compared sev eral state-of-the-art data-dependent and data-independent methods for faster approximate K -MIPS and it was found that clustering-based MIPS performs signiﬁcantly better than other approaches. Ho wev er the focus of the comparison was on performance during the inference 7 stage. In HMNs, K -MIPS must be used at both training stage and inference stages. T o verify if the same trend can been seen during learning stage as well, we compared three different approaches: Clustering: This was e xplained in detail in section 3. WT A-Hash: W inner T akes All hashing [ 25 ] is a hashing-based K -MIPS algorithm which also con verts MIPS to MCSS by augmenting additional dimensions to the v ectors. This method used n hash functions and each hash function does p different random permutations of the v ector . Then the preﬁx constituted by the ﬁrst k elements of each permuted vector is used to construct the hash for the vector . PCA-T ree: PCA-T ree [ 7 ] is the state-of-the-art tree-based method, which con verts MIPS to NNS by vector augmentation. It uses the principal components of the data to construct a balanced binary tree with data residing in the leav es. For a fair comparison, we varied the hyper-parameters of each algorithm in such a way that the av erage speedup is approximately the same. T able 1 shows the performance of all three methods, compared to a full softmax. From the table, it is clear that the clustering-based method performs signiﬁcantly better than the other two methods. Howe ver , performances are lo wer when compared to the performance of the full softmax. As a next experiment, we analyze various the strategies proposed in Section 3.1 to reduce the approximation bias of clustering-based K -MIPS: T op-K: This strategy picks the v ectors in the top K clusters as candidates. Sample-K: This strategy samples K clusters, without replacement, based on a probability distribution based on the dot product of the query with the cluster centroids. When combined with the T op- K strategy , we ignore clusters selected by the T op- k strategy for sampling. Rand-block: This strategy di vides the memory into sev eral blocks and uniformly samples a random block as candidate. W e experimented with 1000 clusters and 2000 clusters. While comparing various training strate gies, we made sure that the ef fective speedup is approximately the same. Memory access to facts per query for all the models is approximately 20,000, hence yielding a 5X speedup. 1000 clusters 2000 clusters T op-K Sample-K rand-block T est Acc. epochs T est Acc. epochs Y es No No 50.2 16 51.5 22 No Y es No 52.5 68 52.8 63 Y es Y es No 52.8 31 53.1 26 Y es No Y es 51.8 32 52.3 26 Y es Y es Y es 52.5 38 52.7 19 T able 2: Accuracy in SQ test set and number of epochs for con vergence. Results are gi ven in T able 2. W e observe that the best approach is to combine the T op-K and Sample-K strategies, with Rand-block not being beneﬁcial. Interestingly , the worst performances correspond to cases where the Sample-K strategy is ignored. 6 Conclusion In this paper, we proposed a hierarchical memory network that exploits K -MIPS for its attention- based reader . Unlike soft attention readers, K -MIPS attention reader is easily scalable to larger memories. This is achie ved by organizing the memory in a hierarchical way . Experiments on the SimpleQuestions dataset demonstrate that e xact K -MIPS attention is better than soft attention. Howe ver , existing state-of-the-art approximate K -MIPS techniques provide a speedup at the cost of some accuracy . Future research will inv estigate designing efﬁcient dynamic K -MIPS algorithms, where the memory can be dynamically updated during training. This should reduce the approximation bias and hence improv e the ov erall performance. 8 References [1] Alex Gra ves, Gre g W ayne, and Ivo Danihelka. Neural turing machines. arXiv pr eprint arXiv:1410.5401 , 2014. [2] Jason W eston, Sumit Chopra, and Antoine Bordes. Memory netw orks. In Pr oceedings Of The International Confer ence on Repr esentation Learning (ICLR 2015) , 2015. In Press. [3] Ankit Kumar et al. Ask me anything: Dynamic memory networks for natural language processing. CoRR , abs/1506.07285, 2015. [4] Antoine Bordes, Nicolas Usunier , Sumit Chopra, and Jason W eston. Large-scale simple question answering with memory networks. arXiv pr eprint arXiv:1506.02075 , 2015. [5] Ronald J. W illiams. Simple statistical gradient-follo wing algorithms for connectionist reinforcement learning. Machine Learning , 8:229–256, 1992. [6] Parikshit Ram and Alexander G. Gray . Maximum inner-product search using cone trees. KDD ’12, pages 931–939, 2012. [7] Y oram Bachrach et al. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. RecSys ’14, pages 257–264, 2014. [8] Anshumali Shriv astava and Ping Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Advances in Neural Information Pr ocessing Systems 27 , pages 2321–2329, 2014. [9] Anshumali Shriv astava and Ping Li. Improv ed asymmetric locality sensitiv e hashing (alsh) for maximum inner product search (mips). In Pr oceedings of Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI) , 2015. [10] Behnam Neyshab ur and Nathan Srebro. On symmetric and asymmetric lshs for inner product search. In Pr oceedings of the 31st International Conference on Mac hine Learning , 2015. [11] Alex Auvolat, Sarath C handar , P ascal V incent, Hugo Larochelle, and Y oshua Bengio. Clustering is ef ﬁcient for approximate maximum inner product search. arXiv pr eprint arXiv:1507.05910 , 2015. [12] Shi Zhong. Efﬁcient online spherical k-means clustering. In Neural Networks, 2005. IJCNN’05. Pr oceed- ings. 2005 IEEE International Joint Confer ence on , volume 5, pages 3180–3185. IEEE, 2005. [13] Jason W eston, Antoine Bordes, Sumit Chopra, and T omas Mikolo v . T owards ai-complete question answering: a set of prerequisite toy tasks. arXiv pr eprint arXiv:1502.05698 , 2015. [14] Sainbayar Sukhbaatar , Arthur Szlam, Jason W eston, and Rob Fergus. End-to-end memory networks. arXiv pr eprint arXiv:1503.08895 , 2015. [15] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller , Arthur Szlam, and Jason W eston. Evaluating prerequisite qualities for learning end-to-end dialog systems. CoRR , abs/1511.06931, 2015. [16] Caiming Xiong, Stephen Merity , and Richard Socher . Dynamic memory networks for visual and textual question answering. CoRR , abs/1603.01417, 2016. [17] W ojciech Zaremba and Ilya Sutskev er . Reinforcement learning neural turing machines. CoRR , abs/1505.00521, 2015. [18] Frederic Morin and Y oshua Bengio. Hierarchical probabilistic neural network language model. In Robert G. Cowell and Zoubin Ghahramani, editors, Pr oceedings of AIST ATS , pages 246–252, 2005. [19] Andriy Mnih and Karol Gregor . Neural variational inference and learning in belief networks. arXiv pr eprint arXiv:1402.0030 , 2014. [20] T omas Mikolov , Kai Chen, Greg Corrado, and Jef frey Dean. Ef ﬁcient estimation of word representations in vector space. In International Confer ence on Learning Representations, W orkshop T rac k , 2013. [21] Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Y oshua Bengio. On using very large target vocab ulary for neural machine translation. In Pr oceedings of ACL,2015 , pages 1–10, 2015. [22] Ryan Spring and Anshumali Shriv astava. Scalable and sustainable deep learning via randomized hashing. CoRR , abs/1602.08194, 2016. 9 [23] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason W eston, and Oksana Y akhnenko. T ranslat- ing embeddings for modeling multi-relational data. In Advances in NIPS , pages 2787–2795. 2013. [24] Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR , abs/1412.6980, 2014. [25] Sudheendra V ijayanarasimhan, Jon Shlens, Rajat Monga, and Jay Y agnik. Deep networks with large output spaces. arXiv pr eprint arXiv:1412.7479 , 2014. 10

Hierarchical Memory Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment