Compositional Vector Space Models for Knowledge Base Completion

Compositional V ector Space Models f or Knowledge Base Completion Arvind Neelakantan, Benjamin Roth, Andrew McCallum Department of Computer Science Uni versity of Massachusetts, Amherst Amherst, MA, 01003 { arvind,beroth,mccallum } @cs.umass.edu Abstract Kno wledge base (KB) completion adds ne w facts to a KB by making inferences from existing facts, for example by infer- ring with high likelihood nationality(X,Y) from bornIn(X,Y) . Most previous methods infer simple one-hop relational synon yms like this, or use as evidence a multi-hop re- lational path treated as an atomic feature, like bornIn(X,Z) → containedIn(Z,Y) . This paper presents an approach that reasons about conjunctions of multi-hop relations non-atomically , composing the implica- tions of a path using a recurrent neural network (RNN) that takes as inputs vec- tor embeddings of the binary relation in the path. Not only does this allo w us to generalize to paths unseen at training time, b ut also, with a single high-capacity RNN, to predict ne w relation types not seen when the compositional model was trained (zero-shot learning). W e assem- ble a new dataset of ov er 52M relational triples, and show that our method im- prov es over a traditional classiﬁer by 11%, and a method leveraging pre-trained em- beddings by 7%. 1 Introduction Constructing large knowledge bases (KBs) sup- ports downstream reasoning about resolved enti- ties and their relations, rather than the noisy tex- tual e vidence surrounding their natural language mentions. For this reason KBs hav e been of in- creasing interest in both industry and academia (Bollacker et al., 2008; Suchanek et al., 2007; Carlson et al., 2010). Such KBs typically con- tain many millions of facts, most of them (en- tity1,relation,entity2) “triples” (also kno wn as bi- nary relations) such as (Barac k Obama, presi- dentOf, USA) and (Br ad Pitt, marriedT o, Angelina J olie) . Ho wev er , ev en the largest KBs are woefully in- complete (Min et al., 2013), missing many impor- tant f acts, and therefore damaging their usefulness in downstream tasks. Ironically , these missing facts can frequently be inferred from other facts al- ready in the KB, thus representing a sort of incon- sistency that can be repaired by the application of an automated process. The addition of new triples by lev eraging existing triples is typically kno wn as KB completion . Early work on this problem focused on learn- ing symbolic rules. For example, Schoenmack- ers et al. (2010) learns Horn clauses predicti ve of ne w binary relations by exhausiti vely exploring re- lational paths of increasing length, and selecting those surpassing an accuracy threshold. (A “path” is a sequence of triples in which the second entity of each triple matches the ﬁrst entity of the next triple.) Lao et al. (2011) introduced the Path Rank- ing Algorithm (PRA), which greatly improves ef- ﬁciency and robustness by replacing e xhaustiv e search with random w alks, and using unique paths as features in a per-tar get-relation binary classiﬁer . A typical predictive feature learned by PRA is that CountryOfHeadquarters(X, Y) is implied by Is- BasedIn(X,A) and StateLocatedIn(A, B) and Coun- tryLocatedIn(B, Y) . Giv en IsBasedIn(Micr osoft, Seattle) , StateLocatedIn(Seattle, W ashington) and CountryLocatedIn(W ashington, USA) , we can in- fer the fact CountryOfHeadquarters(Micr osoft, USA) using the predictiv e feature. In later work, Lao et al. (2012) greatly increase av ailable raw material for paths by augmenting KB-schema rela- tions with relations deﬁned by the text connecting mentions of entities in a large corpus (also known as OpenIE relations (Banko et al., 2007)). Ho wev er , these symbolic methods can produce many millions of distinct paths, each of which is categorically distinct, treated by PRA as a dis- tinct feature. (See Figure 1.) Even putting aside the OpenIE relations, this limits the applicability of these methods to modern KBs that have thou- sands of relation types, since the number of dis- tinct paths increases rapidly with the number of re- lation types. If textually-deﬁned OpenIE relations are included, the problem is obviously far more se vere. Better generalization can be gained by operat- ing on embedded vector representations of rela- tions, in which vector similarity can be interpreted as semantic similarity . For example, Bordes et al. (2013) learn lo w-dimensional vector representa- tions of entities and KB relations, such that vector dif ferences between two entities should be close to the vectors associated with their relations. This approach can ﬁnd relation synonyms, and thus per- form a kind of one-to-one, non-path-based relation prediction for KB completion. Similarly Nickel et al. (2011) and Socher et al. (2013a) perform KB completion by learning embeddings of rela- tions, but based on matrices or tensors. Universal schema (Riedel et al., 2013) learns to perform rela- tion prediction cast as matrix completion (likewise using vector embeddings), b ut predicts textually- deﬁned OpenIE relations as well as KB relations, and embeds entity-pairs in addition to individual entities. Like all of the abov e, it also reasons about individual relations, not the evidence of a connected path of relations. This paper proposes an approach combining the adv antages of (a) reasoning about conjunctions of relations connected in a path, and (b) generaliza- tion through vector embeddings, and (c) reasoning non-atomically and compositionally about the el- ements of the path, for further generalization. Our method uses recurrent neural networks (RNNs) (W erbos, 1990) to compose the semantics of relations in an arbitrary-length path. At each path-step it consumes both the vector embedding of the next relation, and the vector representing the path-so-far , then outputs a composed vector (rep- resenting the extended path-so-far), which will be the input to the next step. After consuming a path, the RNN should output a vector in the semantic neighborhood of the relation between the ﬁrst and last entity of the path. For example, after con- suming the relation vectors along the path Melinda Gates → Bill Gates → Micr osoft → Seattle , our method produces a vector very close to the rela- tion livesIn . founded in headquartered in headquarters located in based in in the U.S. state of located in the state of beautiful city in in state state in the NW region of located in country state part of democratic state in Microsoft Seattle Washington IsBasedIn StateLocatedIn CountryLocatedIn CountryOfHeadquarters USA ………… ………… ………… Figure 1: Semantically similar paths connecting entity pair (Microsoft, USA). Our compositional approach allow us at test time to make predictions from paths that were un- seen during training, because of the generaliza- tion provided by vector neighborhoods, and be- cause they are composed in non-atomic fashion. This allows our model to seamlessly perform in- ference on many millions of paths in the KB graph. In most of our experiments, we learn a separate RNN for predicting each relation type, but alterna- ti vely , by learning a single high-capacity composi- tion function for all relation types, our method can perform zero-shot learning—predicting ne w rela- tion types for which the composition function was ne ver explicitly trained. Related to our work, ne w versions of PRA (Gardner et al., 2013; Gardner et al., 2014) use pre-trained vector representations of relations to alle viate its feature explosion problem—b ut the core mechanism continues to be a classiﬁer based on atomic-path features. In the 2013 work many paths are collapsed by clustering paths accord- ing to their relations’ embeddings, and substitut- ing cluster ids for the original relation types. In the 2014 work unseen paths are mapped to nearby paths seen at training time, where nearness is mea- sured using the embeddings. Neither is able to per- form zero-shot learning since there must be a clas- sifer for each predicted relation type. Furthermore their pre-trained vectors do not have the opportu- nity to be tuned to the KB completion task because the two sub-tasks are completely disentangled. An additional contrib ution of our work is a ne w large-scale data set of over 52 million triples, and its preprocessing for purposes of path-based KB completion (can be do wnloaded from http: //iesl.cs.umass.edu/downloads/ inferencerules/release.tar.gz ). The dataset is build from the combination of Freebase (Bollacker et al., 2008) and Google’ s entity linking in ClueW eb (Orr et al., 2013). Rather than Gardner’ s 1000 distinct paths per relation type, we hav e over 2 million. Rather than Gardner’ s 200 Microsoft Seattle Washington IsBasedIn StateLocatedIn USA CountryLocatedIn Compositon Compositon CountryOfHeadquarters ~ Figure 2: V ector Representations of the paths are computed by applying the composition function recursiv ely . entity pairs, we use over 10k. All experimental comparisons belo w are performed on this ne w data set. On this challenging large-scale dataset our com- positional method outperforms PRA (Lao et al., 2012), and Cluster PRA (Gardner et al., 2013) by 11% and 7% respectiv ely . A further contribution of our work is a new , surprisingly strong baseline method using classiﬁers of path bigram features, which beats PRA and Cluster PRA, and statisti- cally ties our compositional method. Our analysis sho ws that our method has substantially dif ferent strengths than the new baseline, and the combi- nation of the two yields a 15% improvement over Gardner et al. (2013). W e also sho w that our zero- shot model is indeed capable of predicting ne w un- seen relation types. 2 Background W e giv e background on PRA which we use to ob- tain a set of paths connecting the entity pairs and the RNN model which we employ to model the composition function. 2.1 Path Ranking Algorithm Since it is impractical to exhausti vely obtain the set of all paths connecting an entity pair in the large KB graph, we use PRA (Lao et al., 2011) to obtain a set of paths connecting the entity pairs. Gi ven a training set of entity pairs for a relation, PRA heuristically ﬁnds a set of paths by perform- ing random walks from the source and target nodes keeping the most common paths. W e use PRA to ﬁnd millions of distinct paths per relation type. W e do not use the random walk probabilities gi ven by PRA since using it did not yield improv ements in our experiments. 2.2 Recurrent Neural Networks Recurrent neural network (RNN) (W erbos, 1990) is a neural network that constructs vector repre- sentation for sequences (of any length). For exam- ple, a RNN model can be used to construct vec- tor representations for phrases or sentences (of any length) in natural language by applying a compo- sition function (Mikolov et al., 2010; Sutskev er et al., 2014; V in yals et al., 2014). The v ector representation of a phrase ( w 1 , w 2 ) consisting of words w 1 and w 2 is gi v en by f ( W [ v ( w 1 ); v ( w 2 )]) where v ( w ) ∈ R d is the vector representation of w , f is an element-wise non linearity function, [ a ; b ] represents the concatenation two vectors a and b along with a bias term, and W ∈ R d × 2 ∗ d +1 is the composition matrix. This operation can be repeated to construct v ector representations of longer phrases. 3 Recurrent Neural Networks for KB Completion This paper proposes a RNN model for KB comple- tion that reasons on the paths connecting an entity pair to predict missing relation types. The vec- tor representations of the paths (of any length) in the KB graph are computed by applying the com- position function recursi vely as shown in Figure 2. T o compute the vector representations for the higher nodes in the tree, the composition function consumes the vector representation of the node’ s two children nodes and outputs a ne w vector of the same dimension. Predictions about missing rela- tion types are made by comparing the vector repre- sentation of the path with the vector representation of the relation using the sigmoid function. W e represent each binary relation using a d - dimensional real v alued vector . W e model com- position using recurrent neural networks (W erbos, 1990). W e learn a separate composition matrix for e very relation that is predicted. Let v r ( δ ) ∈ R d be the vector representation of relation δ and v p ( π ) ∈ R d be the vector represen- tation of path π . v p ( π ) denotes the relation vec- tor if path π is of length one. T o predict relation δ = CountryOfHeadquarter s , the vector represen- tation of the path π = IsBasedIn → StateLocate- dIn containing two relations IsBasedIn and State- LocatedIn is computed by (Figure 2), v p ( π ) = f ( W δ [ v r ( IsBasedIn ); v r ( StateLocatedIn )]) where f = sig moid is the element-wise non- linearity function, W δ ∈ R d ∗ 2 d +1 is the compo- sition matrix for δ = CountryOfHeadquarters and [ a ; b ] represents the concatenation of two vectors a ∈ R d , b ∈ R d along with a bias feature to get a ne w vector [ a ; b ] ∈ R 2 d +1 . The vector representation of the path Π = Is- BasedIn → StateLocatedIn → CountryLocatedIn in Figure 2 is computed similarly by , v p (Π) = f ( W δ [ v p ( π ); v r ( CountryLocatedIn )]) where v p ( π ) is the vector representation of path Is- BasedIn → StateLocatedIn . While computing the vector representation of a path we always trav erse left to right, composing the relation vector in the right with the accumulated path vector in the left 1 . This makes our model a recurrent neural network (W erbos, 1990). Finally , we make a prediction regarding Coun- tryOfHeadquarters(Micr osoft, USA) using the path Π = IsBasedIn → StateLocatedIn → Coun- tryLocatedIn by comparing the vector represen- tation of the path ( v p (Π) ) with the vector repre- sentation of the relation CountryOfHeadquarters ( v r ( CountryOfHeadquarters )) using the sigmoid function. 3.1 Model T raining W e train the model with the existing facts in a KB using them as positiv e e xamples and nega- ti ve e xamples are obtained by treating the unob- served instances as negati ve examples (Mintz et al., 2009; Lao et al., 2011; Riedel et al., 2013; Bor - des et al., 2013). Unlike in previous work that use RNNs(Socher et al., 2011; Iyyer et al., 2014; Irsoy and Cardie, 2014), a challenge with using them for our task is that among the set of paths connect- ing an entity pair , we do not observe which of the path(s) is predictiv e of a relation. W e select the path that is closest to the relation type to be pre- dicted in the vector space. This not only allo ws for faster training (compared to marginalization) but also giv es improv ed performance. This tech- nique has been successfully used in models other than RNNs previously (W eston et al., 2013; Nee- lakantan et al., 2014). 1 we did not get signiﬁcant improvements when we tried more sophisticated ordering schemes for computing the path representations. Algorithm 1 Training Algorithm of RNN model for rela- tion δ 1: Input: Λ δ = Λ + δ ∪ Λ − δ , Φ δ , number of itera- tions T , mini-batch size B 2: Initialize v r , W δ randomly 3: for t = 1 , 2 , . . . , T do 4: ∇ v r = 0 , ∇ W δ = 0 and b = 0 5: f or λ = ( γ , δ ) ∈ Λ δ do 6: µ λ = arg max π ∈ Φ δ ( γ ) v p ( π ) .v r ( δ ) 7: Accumulate gradients to ∇ v r , ∇ W δ 8: using path µ λ . 9: b = b + 1 10: if b = B then 11: Gradient Update for v r , W δ 12: ∇ v r = 0 , ∇ W δ = 0 and b = 0 13: end if 14: end f or 15: if b > 0 then 16: Gradient Update for v r , W δ 17: end if 18: end for 19: Output: v r , W δ W e assume that we are giv en a KB (for exam- ple, Freebase enriched with SV O triples) contain- ing a set of entity pairs Γ , set of relations ∆ and a set of observed facts Λ + where ∀ λ = ( γ , δ ) ∈ Λ + ( γ ∈ Γ , δ ∈ ∆) indicates a positi ve fact that entity pair γ is in relation δ . Let Φ δ ( γ ) denote the set of paths connecting entity pair γ giv en by PRA for predicting relation δ . In our task, we only observe the set of paths connecting an entity pair but we do not observe which of the path(s) is predictive of the fact. W e treat this as a latent v ariable ( µ λ for the fact λ ) and we assign µ λ the path whose vector represen- tation has maximum dot product with the vector representation of the relation to be predicted. For example, µ λ for the fact λ = ( γ , δ ) ∈ Λ + is giv en by , µ λ = arg max π ∈ Φ δ ( γ ) v p ( π ) .v r ( δ ) During training, we assign µ λ using the current parameter estimates. W e use the same procedure to assign µ λ for unobserved facts that are used as negati v e examples during training. W e train a separate RNN model for predicting each relation and the parameters of the model for predicting relation δ ∈ ∆ are Θ = { v r ( ω ) ∀ ω ∈ ∆ , W δ } . Giv en a training set consisting of posi- ti ve ( Λ + δ ) and negati ve ( Λ − δ ) instances 2 for relation δ , the parameters are trained to maximize the log likelihood of the training set with L-2 regulariza- tion. Θ ∗ = arg max Θ X λ =( γ ,δ ) ∈ Λ + δ P ( y λ = 1; Θ)+ X λ =( γ ,δ ) ∈ Λ − δ P ( y λ = 0; Θ) − ρ k Θ k 2 where y λ is a binary random v ariable which takes the value 1 if the fact λ is true and 0 otherwise, and the probability of a fact P ( y λ = 1; Θ) is giv en by , P ( y λ = 1; Θ) = sig moid ( v p ( µ λ ) .v r ( δ )) w here µ λ = arg max π ∈ Φ δ ( γ ) v p ( π ) .v r ( δ ) and P ( y λ = 0; Θ) = 1 − P ( y λ = 1; Θ) . The relation vectors and the composition matrix are initialized randomly . W e train the network us- ing backpropagation through structure (Goller and K ¨ uchler , 1996). 4 Zero-shot KB Completion The KB completion task inv olv es predicting facts on thousands of relations types and it is highly de- sirable that a method can infer facts about relation types without directly training for them. Gi ven the vector representation of the relations, we show that our model described in the previous section is ca- pable of predicting relational f acts without explic- itly training for the target (or test) relation types (zero-shot learning). In zero-shot or zero-data learning (Larochelle et al., 2008; Palatucci et al., 2009), some labels or classes are not a v ailable during training the model and only a description of those classes are given at prediction time. W e make two modiﬁcations to the model described in the pre vious section, (1) learn a general composition matrix, and (2) ﬁx re- lation v ectors with pre-trained vectors, so that we can predict relations that are unseen during train- ing. This ability of the model to generalize to un- seen relations is be yond the capabilities of all pre- vious methods for KB inference (Schoenmackers et al., 2010; Lao et al., 2011; Gardner et al., 2013; Gardner et al., 2014). W e learn a general composition matrix for all relations instead of learning a separate composi- tion matrix for ev ery relation to be predicted. So, 2 we sub-sample a portion of the set of all unobserv ed in- stances. for example, the v ector representation of the path π = IsBasedIn → StateLocatedIn containing two relations IsBasedIn and StateLocatedIn is com- puted by (Figure 2), v p ( π ) = f ( W [ v r ( IsBasedIn ); v r ( StateLocatedIn )]) where W ∈ R d ∗ 2 d +1 is the general composition matrix. W e initialize the vector representations of the binary relations ( v r ) using the representations learned in Riedel et al. (2013) and do not update them during training. The relation vectors are not updated because at prediction time we would be predicting relation types which are ne ver seen dur- ing training and hence their v ectors would nev er get updated. W e learn only the general composi- tion matrix in this model. W e train a single model for a set of relation types by replacing the sigmoid function with a softmax function while computing probabilities and the parameters of the composi- tion matrix are learned using the av ailable train- ing data containing instances of few relations. The other aspects of the model remain unchanged. T o predict facts whose relation types are unseen during training, we compute the vector represen- tation of the path using the general composition matrix and compute the probability of the fact us- ing the pre-trained relation vector . For example, using the vector representation of the path Π = Is- BasedIn → StateLocatedIn → CountryLocatedIn in Figure 2, we can predict any relation irrespec- ti ve of whether they are seen at training by com- paring it with the pre-trained relation vectors. 5 Experiments The hyperparameters of all the models were tuned on the same held-out de velopment data. All the neural network models are trained for 150 itera- tions using 50 dimensional relation vectors, and we set the L2-regularizer and learning rate to 0 . 0001 and 0 . 1 respectiv ely . W e halv ed the learn- ing rate after e very 60 iterations and use mini- batches of size 20 . The neural networks and the classiﬁers were optimized using AdaGrad (Duchi et al., 2011). 5.1 Data W e ran experiments on Freebase (Bollacker et al., 2008) enriched with information from ClueW eb . Entities 18M Freebase triples 40M ClueW eb triples 12M Relations 25,994 Relation types tested 46 A vg. paths/relation 2.3M A vg. training facts/relation 6638 A vg. positive test instances/relation 3492 A vg. negati ve test instances/relation 43,160 T able 1: Statistics of our dataset. W e use the publicly av ailable entity links to Free- base in the ClueW eb dataset (Orr et al., 2013). Hence, we create nodes only for Freebase enti- ties in our KB graph. W e remove facts containing /type/object/type as they do not gi ve useful pre- dicti ve information for our task. W e get triples from ClueW eb by considering sentences that con- tain two entities linked to Freebase. W e e xtract the phrase between the two entities and treat them as the relation types. For phrases that are of length greater than four we keep only the ﬁrst and last two words. This helps us to av oid the time con- suming step of dependency parsing the sentence to get the relation type. These triples are similar to facts obtained by OpenIE (Banko et al., 2007). T o reduce noise, we select relation types that occur at least 50 times. W e ev aluate on 46 relation types in Freebase that have the most number of instances. The methods are ev aluated on a subset of facts in Freebase that were hidden during training. T able 1 sho ws important statistics of our dataset. 5.2 Predicti ve Paths T able 2 shows predictive paths for 4 relations learned by the RNN model. The high quality of unseen paths is indicati ve of the fact that the RNN model is able to generalize to paths that are nev er seen during training. 5.3 Results Using our dataset, we compare the performance of the follo wing methods: PRA Classiﬁer is the method in Lao et al. (2012) which trains a logistic regression classiﬁer by cre- ating a feature for e very path type. Cluster PRA Classiﬁer is the method in Gard- ner et al. (2013) which replaces relation types from ClueW eb triples with their cluster membership in the KB graph before the path ﬁnding step. Af- ter this step, their method proceeds in exactly the same manner as Lao et al. (2012) training a logis- tic regression classiﬁer by creating a feature for e very path type. W e use pre-trained relation vec- tors from Riedel et al. (2013) and use k-means clustering to cluster the relation types to 25 clus- ters as done in Gardner et al. (2013). Composition-Add uses a simple element-wise ad- dition follo wed by sigmoid non-linearity as the composition function similar to Y ang et al. (2014). RNN-random is the supervised RNN model de- scribed in section 3 with the relation vectors ini- tialized randomly . RNN is the supervised RNN model described in section 3 with the relation vectors initialized using the method in Riedel et al. (2013). PRA Classiﬁer-b is our simple extension to the method in Lao et al. (2012) which additionally uses bigrams in the path as features. W e add a special start and stop symbol to the path before computing the bigram features. Cluster PRA Classiﬁer-b is our simple extension to the method in Gardner et al. (2013) which ad- ditionally uses bigram features computed as previ- ously described. RNN + PRA Classiﬁer combines the predictions of RNN and PRA Classiﬁer . W e combine the pre- dictions by assigning the score of a fact as the sum of their rank in the two models after sorting them in ascending order . RNN + PRA Classiﬁer-b combines the predictions of RNN and PRA Classiﬁer-b using the technique described pre viously . T able 3 sho ws the results of our experiments. The method described in Gardner et al. (2014) is not included in the table since the publicly a vail- able implementation does not scale to our large dataset. First, we show that it is better to train the models using all the path types instead of using only the top 1 , 000 path types as done in previous work (Gardner et al., 2013; Gardner et al., 2014). W e can see that the RNN model performs signif- icantly better than the baseline methods of Lao et al. (2012) and Gardner et al. (2013). The perfor- mance of the RNN model is not af fected by initial- ization since using random vectors and pre-trained vectors results in similar performance. A surprising result is the impressiv e perfor- mance of our simple extension to the classiﬁer approach. After the addition of bigram features, the naive PRA method is as effecti v e as the Clus- Relation: /book/written work/original language/ (book “x” written in language “y”) Seen paths: /book/written work/previous in series → /book/written work/author → /people/person/nationality → /people/person/nationality − 1 → /people/person/languages /book/written work/author → /people/ethnicity/people − 1 → /people/ethnicity/languages spoken Unseen paths: ”in” − 1 - ”writer” − 1 → /people/person/nationality − 1 → /people/person/languages /book/written work/author → addresses → /people/person/nationality − 1 → /people/person/languages Relation: /people/person/place of birth/ (person “x” born in place “y”) Seen paths: “was,born,in” → /location/mailing address/citytown − 1 → /location/mailing address/state province region “from” → /location/location/contains − 1 Unseen paths: “born,in” → /location/location/contains → “near” − 1 “was,born,in” → commonly ,kno wn,as − 1 Relation: /geography/ri ver/cities/ (river “x” ﬂows thr ough or bor ders “y”) Seen paths: “at” → /location/location/contains − 1 “meets,the” → /transportation/bridge/body of water spanned − 1 → /location/location/contains − 1 → “in” Unseen paths: /geography/lake/outﬂo w − 1 → /location/location/contains − 1 /geography/lake/outﬂo w − 1 → /location/location/contains − 1 → “near” Relation: /people/family/members/ (person “y” part of family “x”) Seen paths: /royalty/monarch/royal line − 1 → /people/person/children → /royalty/monarch/royal line → /royalty/royal line/monarchs from this line /royalty/royal line/monarchs from this line → /people/person/parents − 1 → /people/person/parents − 1 → /people/person/parents − 1 Unseen paths: /royalty/monarch/royal line − 1 → “leader” − 1 → “king” → “was,married,to” − 1 “of,the” − 1 → “but,also,of ” → “married” → “defended” − 1 T able 2: Predictiv e paths, according to the RNN model, for 4 tar get relations. T wo examples of seen and unseen paths are shown for each target relation. In verse relations are marked by − 1 , i.e, r ( x, y ) = ⇒ r − 1 ( y , x ) , ∀ ( x, y ) ∈ r . Relations within quotes are OpenIE (textual) relation types. train with top 1000 paths train with all paths Method MAP MAP PRA Classiﬁer 43.46 51.31 Cluster PRA Classiﬁer 46.26 53.23 Composition-Add 40.23 45.37 RNN-random 45.52 56.91 RNN 46.61 56.95 PRA Classiﬁer-b 48.09 58.13 Cluster PRA Classiﬁer-b 48.72 58.02 RNN + PRA Classiﬁer 49.92 58.42 RNN + PRA Classiﬁer-b 51.94 61.17 T able 3: Results comparing different methods on 46 types. All the methods perform better when trained using all the paths than training using the top 1 , 000 paths. When training with all the paths, RNN performs signiﬁcantly ( p < 0 . 005 ) better than PRA Classiﬁer and Cluster PRA Classiﬁer . The small dif ference in performance between RNN and both PRA Classiﬁer-b and Cluster PRA Classiﬁer-b is not statistically signiﬁcant. The best results are obtained by combining the predictions of RNN with PRA Classiﬁer-b which performs signiﬁcantly ( p < 10 − 5 ) better than both PRA Classiﬁer-b and Cluster PRA Classiﬁer-b . ter PRA method. The small dif ference in perfor- mance between RNN and both PRA Classiﬁer-b and Cluster PRA Classiﬁer -b is not statistically signiﬁcant. W e conjecture that our method has substantially different strengths than the new base- line. While the classiﬁer with bigram features has an ability to accurately memorize important local structure, the RNN model generalizes better to un- train with top 1000 paths train with all paths Method MAP MAP RNN 43.82 50.10 zero-shot 19.28 20.61 Random 7.59 T able 4: Results comparing the zero-shot model with supervised RNN and a random baseline on 10 types. RNN is the fully supervised model de- scribed in section 3 while zero-shot is the model described in section 4. The zero-shot model with- out explicitly training for the target relation types achie ves impressiv e results by performing signiﬁ- cantly ( p < 0 . 05 ) better than a random baseline. seen paths that are very dif ferent from the paths seen is training. Empirically , combining the pre- dictions of RNN and PRA Classiﬁer-b achieves a statistically signiﬁcant gain ov er PRA Classiﬁer-b . 5.3.1 Zero-shot T able 4 sho ws the results of the zero-shot model described in section 4 compared with the fully su- pervised RNN model (section 3) and a baseline that produces a random ordering of the test facts. W e e valuate on randomly selected 10 (out of 46 ) relation types, hence for the fully supervised ver - sion we train 10 RNNs, one for each relation type. For e v aluating the zero-shot model, we randomly split the relations into two sets of equal size and train a zero-shot model on one set and test on the other set. So, in this case we have two RNNs making predictions on relation types that they hav e ne ver seen during training. As expected, the fully supervised RNN outperforms the zero-shot model by a large margin but the zero-shot model with- out using any direct supervision clearly performs much better than a random baseline. 5.3.2 Discussion T o in vestigate whether the performance of the RNNs were affected by multiple local optima is- sues, we combined the predictions of ﬁv e different RNNs trained using all the paths. Apart from RNN and RNN-random , we trained three more RNNs with different random initialization and the perfor - mance of the three RNNs individually are 57 . 09 , 57 . 11 and 56 . 91 . The performance of the ensem- ble is 59 . 16 and their performance stopped im- proving after using three RNNs. So, this indicates that even though multiple local optima affects the performance, it is likely not the only issue since the performance of the ensemble is still less than the performance of RNN + PRA Classiﬁer-b . W e suspect the RNN model does not capture some of the important local structure as well as the classiﬁer using bigram features. T o overcome this drawback, in future work, we plan to explore compositional models that ha ve a longer memory (Hochreiter and Schmidhuber , 1997; Cho et al., 2014; Mikolov et al., 2014). W e also plan to in- clude vector representations for the entities and de velop models that address the issue of polysemy in verb phrases (Cheng et al., 2014). 6 Related W ork KB Completion includes methods such as Lin and Pantel (2001), Y ates and Etzioni (2007) and Berant et al. (2011) that learn inference rules of length one. Schoenmackers et al. (2010) learn general inference rules by considering the set of all paths in the KB and selecting paths that sat- isfy a certain precision threshold. Their method does not scale well to modern KBs and also de- pends on carefully tuned thresholds. Lao et al. (2011) train a simple logistic regression classiﬁer with NELL KB paths as features to perform KB completion while Gardner et al. (2013) and Gard- ner et al. (2014) extend it by using pre-trained re- lation vectors to overcome feature sparsity . Re- cently , Y ang et al. (2014) learn inference rules us- ing simple element-wise addition or multiplication as the composition function. Compositional V ector Space Models have been de veloped to represent phrases and sentences in natural language as vectors (Mitchell and Lap- ata, 2008; Baroni and Zamparelli, 2010; Y esse- nalina and Cardie, 2011). Neural networks ha ve been successfully used to learn vector representa- tions of phrases using the vector representations of the words in that phrase. Recurrent neural net- works hav e been used for many tasks such as lan- guage modeling (Mikolov et al., 2010), machine translation (Sutskev er et al., 2014) and parsing (V inyals et al., 2014). Recursiv e neural networks, a more general version of the recurrent neural net- works have been used for many tasks like pars- ing (Socher et al., 2011), sentiment classiﬁcation (Socher et al., 2012; Socher et al., 2013c; Irsoy and Cardie, 2014), question answering (Iyyer et al., 2014) and natural language logical semantics (Bo wman et al., 2014). Our ov erall approach is similar to RNNs with attention (Bahdanau et al., 2014; Gra ves, 2013) since we select a path among the set of paths connecting the entity pair to make the ﬁnal prediction. Zero-shot or zero-data learning was introduced in Larochelle et al. (2008) for character recogni- tion and drug discov ery . Palatucci et al. (2009) perform zero-shot learning for neural decoding while there has been plenty of work in this direc- tion for image recognition (Socher et al., 2013b; Frome et al., 2013; Norouzi et al., 2014). 7 Conclusion W e de velop a compositional v ector space model for knowledge base completion using recurrent neural networks. In our challeng- ing large-scale dataset av ailable at http: //iesl.cs.umass.edu/downloads/ inferencerules/release.tar.gz , our method outperforms two baseline methods and performs competitiv ely with a modiﬁed stronger baseline. The best results are obtained by combining the predictions of our model with the predictions of the modiﬁed baseline which achie ves a 15 % improv ement over Gardner et al. (2013). W e also sho w that our model has the ability to perform zero-shot inference. Acknowledgments W e thank Matt Gardner for releasing the PRA code, and for answering numerous question about the code and data. W e also thanks the Stanford NLP group for releasing the neural networks code. This work w as supported in part by the Center for Intelligent Information Retriev al, in part by D ARP A under agreement number F A8750-13-2- 0020, in part by an award from Google, and in part by NSF grant #CNS-0958392. The U.S. Gov- ernment is authorized to reproduce and distrib ute reprints for Governmental purposes notwithstand- ing any cop yright notation thereon. Any opinions, ﬁndings and conclusions or recommendations ex- pressed in this material are those of the authors and do not necessarily reﬂect those of the sponsor . References Dzmitry Bahdanau, Kyunghyun Cho, and Y oshua Ben- gio. 2014. Neural machine translation by jointly learning to align and translate. In ArXiv . Michele Banko, Michael J Cafarella, Stephen Soder- land, Matt Broadhead, and Oren Etzioni. 2007. Open information e xtraction from the web . In Inter - national Joint Confer ence on Artiﬁcial Intelligence. Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectiv es are matrices: Representing adjectiv e-noun constructions in semantic space. In Empirical Methods in Natural Language Pr ocess- ing . Jonathan Berant, Ido Dagan, and Jacob Goldberger . 2011. Global learning of typed entailment rules. In Association for Computational Linguistics . Kurt Bollacker , Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie T aylor . 2008. Freebase: a col- laborativ ely created graph database for structuring human kno wledge. In Pr oceedings of the A CM SIG- MOD International Confer ence on Management of Data . Antoine Bordes, Nicolas Usunier , Alberto Garc ´ ıa- Dur ´ an, Jason W eston, and Oksana Y akhnenko. 2013. T ranslating embeddings for modeling multi- relational data. In Advances in Neural Information Pr ocessing Systems. Samuel R. Bowman, Christopher Potts, and Christo- pher D Manning. 2014. Recursi ve neural networks for learning logical semantics. In CoRR . Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Este vam R. Hruschka, and A. 2010. T oward an architecture for nev er-ending language learning. In In AAAI . Cheng, Jianpeng Kartsaklis, and Edward Grefenstette. 2014. In vestigating the role of prior disambiguation in deep-learning compositional models of meaning. In In Learning Semantics workshop NIPS. Kyungh yun Cho, Bart van Merrienboer, Dzmitry Bah- danau, and Y oshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder ap- proaches. In W orkshop on Syntax, Semantics and Structur e in Statistical T ranslation . John Duchi, Elad Hazan, and Y oram Singer . 2011. Adaptiv e subgradient methods for online learning and stochastic optimization. In J ournal of Machine Learning Resear ch . Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffre y Dean, Marc’Aurelio Ranzato, and T omas Mikolov . 2013. Devise: A deep visual- semantic embedding model. In Neural Information Pr ocessing Systems. Matt Gardner , Partha Pratim T alukdar , Bryan Kisiel, and T om M. Mitchell. 2013. Improving learning and inference in a large knowledge-base using la- tent syntactic cues. In Empirical Methods in Natural Language Pr ocessing . Matt Gardner , Partha T alukdar , Jayant Krishnamurthy , and T om Mitchell. 2014. Incorporating vector space similarity in random walk inference o ver knowledge bases. In Empirical Methods in Natural Language Pr ocessing . Christoph Goller and Andreas K ¨ uchler . 1996. Learn- ing task-dependent distrib uted representations by backpropagation through structure. In IEEE T rans- actions on Neural Networks. Alex Graves. 2013. Generating sequences with recur- rent neural networks. In ArXiv . Sepp Hochreiter and J ¨ urgen Schmidhuber . 1997. Long short-term memory . In Neural Computation. Ozan Irsoy and Claire Cardie. 2014. Deep recursi ve neural networks for compositionality in language. In Neural Information Pr ocessing Systems. Mohit Iyyer , Jordan Boyd-Graber , Leonardo Claudino, Richard Socher , and Hal Daum ´ e III. 2014. A neural network for factoid question answering over para- graphs. In Empirical Methods in Natural Language Pr ocessing. Ni Lao, T om Mitchell, and William W . Cohen. 2011. Random w alk inference and learning in a lar ge scale knowledge base. In Conference on Empirical Meth- ods in Natural Languag e Pr ocessing . Ni Lao, Amarnag Subramanya, Fernando Pereira, and W illiam W . Cohen. 2012. Reading the web with learned syntactic-semantic inference rules. In Joint Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing and Computational Natural Lan- guage Learning . Hugo Larochelle, Dumitru Erhan, and Y oshua Bengio. 2008. Zero-data learning of ne w tasks. In National Confer ence on Artiﬁcial Intelligence. Dekang Lin and Patrick Pantel. 2001. Dirt - discov ery of inference rules from text. In International Con- fer ence on Knowledge Discovery and Data Mining . T omas Mikolov , Martin Karaﬁ ´ at, Lukas Burget, Jan Cernock ´ y, and Sanjeev Khudanpur . 2010. Recur- rent neural network based language model. In An- nual Confer ence of the International Speech Com- munication Association . T omas Mikolov , Armand Joulin, Sumit Chopra, Micha ¨ el Mathieu, and Marc’Aurelio Ranzato. 2014. Learning longer memory in recurrent neural net- works. In CoRR . Bonan Min, Ralph Grishman, Li W an, Chang W ang, and David Gondek. 2013. Distant supervision for relation extraction with an incomplete knowledge base. In HLT -NAA CL , pages 777–782. Mike Mintz, Stev en Bills, Rion Snow , and Dan Juraf- sky . 2009. Distant supervision for relation extrac- tion without labeled data. In Association for Com- putational Linguistics and International Joint Con- fer ence on Natural Language Pr ocessing . Jeff Mitchell and Mirella Lapata. 2008. V ector-based models of semantic composition. In Association for Computational Linguistics . Arvind Neelakantan, Jeev an Shankar , Alexandre Pas- sos, and Andrew McCallum. 2014. Efﬁcient non- parametric estimation of multiple embeddings per word in vector space. In Empirical Methods in Nat- ural Languag e Pr ocessing. Maximilian Nickel, V olker T resp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In International Confer ence on Machine Learning. Mohammad Norouzi, T omas Mikolov , Samy Bengio, Y oram Singer , Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffre y Dean. 2014. Zero-shot learning by con vex combination of semantic em- beddings. In International Confer ence on Learning Repr esentations. Dav e Orr , Amarnag Subraman ya, Evgeniy Gabrilovich, and Michael Ringgaard. 2013. 11 billion clues in 800 million documents: A web research corpus annotated with freebase concepts. http://googleresearch.blogspot.com/2013/07/11- billion-clues-in-800-million.html. Mark Palatucci, Dean Pomerleau, Geof frey Hinton, and T om Mitchell. 2009. Zero-shot learning with semantic output codes. In Neural Information Pr o- cessing Systems. Sebastian Riedel, Limin Y ao, Andre w McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and uni versal schemas. In HLT -NAA CL . Stefan Schoenmackers, Oren Etzioni, Daniel S. W eld, and Jesse Davis. 2010. Learning ﬁrst-order horn clauses from web text. In Empirical Methods in Nat- ural Languag e Pr ocessing . Richard Socher , Clif f Chiung-Y u Lin, Christopher D. Manning, and Andre w Y . Ng. 2011. P arsing natu- ral scenes and natural language with recursive neural networks. In Pr oceedings of the 26th International Confer ence on Machine Learning (ICML) . Richard Socher, Brody Huval, Christopher D. Man- ning, and Andrew Y . Ng. 2012. Semantic com- positionality through recursive matrix-vector spaces. In J oint Confer ence on Empirical Methods in Natu- ral Language Pr ocessing and Computational Natu- ral Languag e Learning . Richard Socher, Danqi Chen, Christopher D Manning, and Andre w Ng. 2013a. Reasoning with neural ten- sor networks for knowledge base completion. In Ad- vances in Neural Information Pr ocessing Systems. Richard Socher , Milind Ganjoo, Christopher D Man- ning, and Andrew Ng. 2013b . Zero-shot learning through cross-modal transfer . In Neur al Information Pr ocessing Systems. Richard Socher , Ale x Perelygin, Jean W u, Jason Chuang, Christopher D. Manning, Andrew Y . Ng, and Christopher Potts. 2013c. Recursive deep mod- els for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Languag e Pr ocessing . Fabian M. Suchanek, Gjergji Kasneci, and Gerhard W eikum. 2007. Y ago: A core of semantic knowl- edge. In Pr oceedings of the 16th International Con- fer ence on W orld W ide W eb . Ilya Sutske ver , Oriol V inyals, and Quoc V . V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Pr o- cessing Systems. Oriol V inyals, Lukasz Kaiser, T erry K oo, Slav Petrov , Ilya Sutskev er, and Geoffrey Hinton. 2014. Gram- mar as a foreign language. In CoRR. Paul W erbos. 1990. Backpropagation through time: what it does and how to do it. In IEEE. Jason W eston, Ron W eiss, and Hector Y ee. 2013. Nonlinear latent factorization by embedding multi- ple user interests. In ACM International Confer ence on Recommender Systems. Bishan Y ang, W en-tau Y ih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding entities and relations for learning and inference in knowledge bases. In CoRR . Alexander Y ates and Oren Etzioni. 2007. Unsuper- vised resolution of objects and relations on the web . In North American Chapter of the Association for Computational Linguistics . Ainur Y essenalina and Claire Cardie. 2011. Compo- sitional matrix-space models for sentiment analysis. In Empirical Methods in Natural Language Pr ocess- ing .

Compositional Vector Space Models for Knowledge Base Completion

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment