Learning Multi-Relational Semantics Using Neural-Embedding Models

Lear ning Multi-Relational Semantics Using Neural-Embedding Models Bishan Y ang ∗ Cornell Univ ersity Ithaca, NY , 14850, USA bishan@cs.cornell.edu W en-tau Yih, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research Redmond, W A 98052, USA scottyih,xiaohe,jfgao,deng@microsoft.com Real-world entities (e.g., people and places) are often connected via relations, forming multi- relational data. Modeling multi-relational data is important in many research areas, from natural language processing to biological data mining [6]. Prior work on multi-relational learning can be categorized into three cate gories: (1) statistical relational learning (SRL) [10], such as Marko v- logic networks [21], which directly encode multi-relational graphs using probabilistic models; (2) path ranking methods [16, 7], which e xplicitly explore the lar ge relational feature space of relations with random w alk; and (3) embedding-based models, which embed multi-relational kno wledge into low-dimensional representations of entities and relations via tensor/matrix factorization [24, 18, 19], Bayesian clustering framew ork [15, 26], and neural networks [20, 1, 2, 25]. Our work focuses on the study of neural-embedding models, where the representations are learned in a neural network architecture. They have shown to be powerful tools for multi-relational learning and inference due to their high scalability and strong generalization abilities. A number of techniques ha v e been recently proposed to learn entity and relation representations using neural networks [1, 2, 25]. They all represent entities as low-dimensional vectors and represent relations as operators that combine the representations of two entities. The main dif ference among these techniques lies in the parametrization of the relation operators. For instance, gi ven two entity vectors, the model of Neural T ensor Network (NTN) [25] represents each relation as a bilinear tensor operator and a linear matrix operator . The model of T ransE [2], on the other hand, represents each relation as a single vector that linearly interacts with the entity vectors. Both models report promising performance on predicting unseen relationships in knowledge bases. Howe ver , they ha ve not been directly compared in terms of the dif ferent choices of relation operators and of the resulting effecti veness. Neither has the design of entity representations in these recent studies been carefully explored. For example, NTN [25] ﬁrst shows the beneﬁts of representing entities as an av erage of word vectors and initializing word vectors with pre-trained vectors from large text corpora. This idea is promising as pre-trained vectors tend to capture syntactic and semantic information from natural language and can assist in better generalization of entity embeddings. Howe ver , many real-world entities are e xpressed as non-compositional phrases (e.g. person names, movie names, etc.), of which meaning cannot be composed by their constituent words. Therefore, a veraging word vectors may not provide an appropriate representation for such entities. In this paper, we examine and compare different types of relation operators and entity vector repre- sentations under a general framew ork for multi-relational learning. Speciﬁcally , we deriv e several recently proposed embedding models, including TransE [2] and NTN [25], and their variants under the same framework. W e empirically ev aluate their performance on a knowledge base completion task using various real-world datasets in a controlled experimental setting and present sev eral inter - esting ﬁndings. First, the models with fewer parameters tend to be better than more complex models in terms of both performance and scalability . Second, the bilinear operator plays an important role in capturing entity interactions. Third, with the same model complexity , multiplicative operations are superior to additive operations in modeling relations. Finally , initializing entity v ectors with pre-trained phrase vectors can signiﬁcantly boost performance, whereas representing entity vectors as an average of word vectors that are initialized with pre-trained vectors may hurt performance. ∗ W ork conducted while interning at Microsoft Research. 1 These ﬁndings have further inspired us to design a simple knowledge base embedding model that signiﬁcantly outperforms existing models in predicting unseen relationships, with a top-10 accuracy of 73.2% (vs. 54.7% by TransE) e valuated on Freebase. 1 A General Framework f or Multi-Relational Representation Learning Most existing neural embedding models for multi-relational learning can be deriv ed from a general framew ork. The input is a relation triplet ( e 1 , r , e 2 ) describing e 1 (the subject ) and e 2 (the object ) that are in a certain relation r . The output is a scalar measuring the validity of the relationship. Each input entity can be represented as a high-dimensional sparse vector (“one-hot” index v ector or “ n -hot” feature v ector). The ﬁrst neural network layer projects the input v ectors to low dimensional vectors, and the second layer projects these vectors to a real value for comparison via a relation- speciﬁc operator (it can also be viewed as a scoring function). More formally , denote x e i as the input for entity e i and W as the ﬁrst layer neural network param- eter . The scoring function for a relation triplet ( e 1 , r , e 2 ) can be written as S ( e 1 ,r,e 2 ) = G r  y e 1 , y e 2  , where y e 1 = f  Wx e 1  , y e 2 = f  Wx e 2  (1) Many choices for the form of the scoring function G r are a vailable. Most of the existing scor- ing functions in the literature can be uniﬁed based on a basic linear transformation g a r , a bilinear transformation g b r or their combination, where g a r and g b r are deﬁned as g a r ( y e 1 , y e 2 ) = A T r  y e 1 y e 2  and g b r ( y e 1 , y e 2 ) = y T e 1 B r y e 2 , (2) which are parametrized by A r and B r , respectiv ely . In T able 1, we summarize se veral popular scoring functions in the literature for a relation triplet ( e 1 , r , e 2 ) , reformulated in terms of the abov e tw o functions. Denote by y e 1 , y e 2 ∈ R n two entity vectors. Denote by Q r 1 , Q r 2 ∈ R m × n and V r ∈ R n matrix or vector parameters for linear transformation g a r . Denote by M r ∈ R n × n and T r ∈ R n × n × m matrix or tensor parameters for bilinear transformation g b r . I ∈ R n is an identity matrix. u r ∈ R m is an additional parameter for relation r . The scoring function for T ransE is derived from || y e 1 − y e 2 + V r || 2 2 as in [2]. Models B r A r Scoring Function G r Distance [3] -  Q r 1 − Q r 2  −|| g a r ( y e 1 , y e 2 ) || 1 Single Layer [25] -  Q r 1 Q r 2  u T r tanh( g a r ( y e 1 , y e 2 )) T ransE [2] I  V T r − V T r  2 g a r ( y e 1 , y e 2 ) − 2 g b r ( y e 1 , y e 2 ) + || V r || 2 2 Bilinear [14] M r - g b r ( y e 1 , y e 2 ) NTN [25] T r  Q r 1 Q r 2  u T r tanh  g a r ( y e 1 , y e 2 ) + g b r ( y e 1 , y e 2 )  T able 1: Comparisons among se veral multi-relational models in their scoring functions. This general framew ork for relationship modeling also applies to the recent deep-structured seman- tic model [12, 22, 23, 9, 28], which learns the relev ance or a single relation between a pair of word sequences. The framework above applies when using multiple neural network layers to project en- tities and using a relation-independent scoring function G r  y e 1 , y e 2  = cos[ y e 1 ( W r ) , y e 2 ( W r )] . The cosine scoring function is a special case of g b r with normalized y e 1 , y e 2 and B r = I . The neural netw ork parameters of all the models discussed abo ve can be learned by minimizing a margin-based ranking objectiv e 1 , which encourages the scores of positiv e relationships (triplets) to be higher than the scores of any negativ e relationships (triplets). Usually only positive triplets are observed in the data. Giv en a set of positiv e triplets T , we can construct a set of negativ e triplets T 0 by corrupting either one of the relation arguments, T 0 = { ( e 0 1 , r, e 2 ) | e 0 1 ∈ E , ( e 0 1 , r, e 2 ) / ∈ T } ∪ { ( e 1 , r, e 0 2 ) | e 0 2 ∈ E , ( e 1 , r, e 0 2 ) / ∈ T } . The training objective is to minimize the margin-based ranking loss L (Ω) = X ( e 1 ,r,e 2 ) ∈ T X ( e 0 1 ,r,e 0 2 ) ∈ T 0 max { S ( e 0 1 ,r,e 0 2 ) − S ( e 1 ,r,e 2 ) + 1 , 0 } (3) 1 Other objectives such as mutual information (as in [12]) and reconstruction loss (as in tensor decomposition approaches [4]) can also be applied. Comparisons among these objecti ves are beyond the scope of this paper . 2 2 Experiments and Discussion Datasets and evaluation metrics W e used the W ordNet (WN) and Freebase (FB15k) datasets in- troduced in [2]. WN contains 151 , 442 triplets with 40 , 943 entities and 18 relations, and FB15k consists of 592 , 213 triplets with 14 , 951 entities and 1345 relations. W e also consider a subset of FB15k (FB15k-401) containing only frequent relations (relations with at least 100 training exam- ples). This results in 560 , 209 triplets with 14 , 541 entities and 401 relations. W e use link prediction as our prediction task as in [2]. For each test triplet, each entity is treated as the target entity to be predicted in turn. Scores are computed for the correct entity and all the corrupted entities in the dic- tionary and are ranked in descending order . W e consider Mean Recipr ocal Rank (MRR) (an average of the reciprocal rank of an answered entity over all test triplets), HITS@10 (top-10 accuracy), and Mean A verag e Precision (MAP) (as used in [4]) as the ev aluation metrics. Implementation details All the models were implemented in C# and using GPU. Training was implemented using mini-batch stochastic gradient descent with AdaGrad [8]. At each gradient step, we sampled for each positive triplet two negativ e triplets, one with a corrupted subject entity and one with a corrupted object entity . The entity vectors are renormalized to ha ve unit length after each gradient step (it is an effecti ve technique that empirically improved all the models). F or the relation parameters, we used standard L2 re gularization. For all models, we set the number of mini- batches to 10 , the dimensionality of the entity vector d = 100 , the regularization parameter 0 . 0001 , and the number of training epochs T = 100 on FB15k and FB15k-401 and T = 300 on WN ( T was determined based on the learning curves where the performance of all models plateaued.) The learning rate was initially set to 0 . 1 and then adapted during training by AdaGrad. 2.1 Model Comparisons W e examine ﬁve embedding models in decreasing order of comple xity: (1) NTN with 4 tensor slices as in [25]; (2) Bilinear+Linear , NTN with 1 tensor slice and without the non-linear layer; (3) TransE with L2 norm 2 , a special case of Bilinear+Linear as described in [2]; (4) Bilinear; (5) Bilinear-diag: a special case of Bilinear where the relation matrix is a diagonal matrix. FB15k FB15k-401 WN MRR HITS @ 10 MRR HITS @ 10 MRR HITS @ 10 NTN 0.25 41.4 0.24 40.5 0.53 66.1 Blinear+Linear 0.30 49.0 0.30 49.4 0.87 91.6 T ransE (D I S T A D D) 0.32 53.9 0.32 54.7 0.38 90.9 Bilinear 0.31 51.9 0.32 52.2 0.89 92.8 Bilinear-diag (D I ST M U L T ) 0.35 57.7 0.36 58.5 0.83 94.2 T able 2: Comparison of different embedding models T able 2 shows the results of all compared methods on all the datasets. In general, we observe that the performance increases as the complexity of the model decreases on FB. NTN, the most complex model, provides the worst performance on both FB and WN, which suggests o verﬁtting. Com- pared to the previously published results of T ransE [2], our implementation achieves much better results (53.9% vs. 47.1% on FB15k and 90.9% vs. 89.2% on WN) using the same e valuation met- ric (HITS@10). W e attribute such discrepancy mainly to the different choice of SGD optimization: AdaGrad vs. constant learning rate. W e also found that Bilinear consistently provides comparable or better performance than T ransE, especially on WN. Note that WN contains much more entities than FB, it may require the parametrization of relations to be more expressi ve to better handle the richness of entities. Interestingly , we found that a simple variant of Bilinear – B I L I N E A R - D I A G , clearly outperforms all baselines on FB and achiev es comparable performance to Bilinear on WN. 2.2 Multiplicati ve vs. Additive Interactions Note that B I L I N E A R - D I A G and T R A N S E have the same number of model parameters and their dif- ference lies in the operational choices of the composition of two entity vectors – B I L I N E A R - D I AG 2 Empirically we found no signiﬁcant differences between L1-norm and L2-norm for the T ransE objectiv e. 3 uses weighted element-wise dot product (multiplicativ e operation) and T R A N S E uses element-wise subtraction with a bias (additi ve operation). T o highlight the dif ference, here we use D I S T M U LT and D I S T A D D to refer to B I L I N E A R - D I A G and T R A N S E , respectively . Comparison between these two models can provide us with more insights on the effect of two common choices of compositional operations – multiplication and addition for modeling entity relations. Overall, we observed supe- rior performance of D I S T M U LT on all the datasets in T able 2. T able 3 sho ws the HITS@10 score on four types of relation cate gories (as deﬁned in [2]) on FB15k-401 when predicting the subject entity and the object entity respectiv ely . W e can see that D I S T M U LT signiﬁcantly outperforms D I S T A D D in almost all the categories. More qualitativ e results can be found in the Appendix. Predicting subject entities Predicting object entities 1-to-1 1-to-n n-to-1 n-to-n 1-to-1 1-to-n n-to-1 n-to-n D I S T A D D 70.0 76.7 21.1 53.9 68.7 17.4 83.2 57.5 D I S T M U LT 75.5 85.1 42.9 55.2 73.7 46.7 81.0 58.8 T able 3: Results by relation categories: one-to-one, one-to-many , many-to-one and many-to-man y 2.3 Entity Representations In the following, we examine the learning of entity representations and introduce two further im- prov ements: using non-linear projection and initializing entity vectors with pre-trained phrase vec- tors. W e focus on D I S T M U LT as our baseline and compare it with the two modiﬁcations D I S T M U LT - tanh (using f = tanh for entity projection 3 ) and D I S T M U LT -tanh-EV -init (initializing the entity parameters with the 1000 -dimensional pre-trained phrase vectors released by wor d2vec [17]) on FB15k-401. W e also reimplemented the word vector representation and initialization technique in- troduced in [25] – each entity is represented as an average of its word vectors and the word vectors are initialized using the 300 -dimensional pre-trained word vectors released by wor d2vec . W e denote this method as D I S T M U LT -tanh-WV -init. Inspired by [4], we design a ne w ev aluation setting where the predicted entities are automatically ﬁltered according to “entity types” (entities that appear as the subjects/objects of a relation hav e the same type deﬁned by that relation). This provides us with better understanding of the model performance when some entity type information is provided. MRR HITS @ 10 MAP (w/ type checking) D I S T M U LT 0.36 58.5 64.5 D I S T M U LT -tanh 0.39 63.3 76.0 D I S T M U LT -tanh-WV -init 0.28 52.5 65.5 D I S T M U LT -tanh-EV -init 0.42 73.2 88.2 T able 4: Evaluation with pre-trained v ectors In T able 4, we can see that D I S T M U LT -tanh-EV -init provides the best performance on all the metrics. Surprisingly , we observed performance drops by D I S T M U LT -tanh-WV -init. W e suspect that this is because word vectors are not appropriate for modeling entities described by non-compositional phrases (more than 73% of the entities in FB15k-401 are person names, locations, organizations and ﬁlms). The promising performance of D I S T M U LT -tanh-EV -init suggests that the embedding model can greatly beneﬁt from pre-trained entity-lev el vectors. 3 Conclusion In this paper we present a uniﬁed frame work for modeling multi-relational representations, scoring, and learning, and conduct an empirical study of sev eral recent multi-relational embedding models under the frame work. W e inv estigate the different choices of relation operators based on linear and bilinear transformations, and also the ef fects of entity representations by incorporating unsupervised vectors pre-trained on e xtra textual resources. Our results show sev eral interesting ﬁndings, enabling 3 When applying non-linearity , we remov e the normalization steps on entity parameters during training as tanh already helps control the scaling freedoms. 4 the design of a simple embedding model that achiev es the new state-of-the-art performance on a popular knowledge base completion task e valuated on Freebase. Giv en the recent successes of deep learning in various applications; e.g. [11, 27, 5], our future work will aim to e xploit deep structure including possibly tensor construct in computing the neural embedding vectors; e.g. [12, 29, 13]. This will extend the current multi-relational neural embedding model to a deep version that is potentially capable of capturing hierarchical structure hidden in the input data. References [1] Antoine Bordes, Xavier Glorot, Jason W eston, and Y oshua Bengio. A semantic matching energy function for learning with multi-relational data. Machine Learning , pages 1–27, 2013. [2] Antoine Bordes, Nicolas Usunier , Alberto Garcia-Duran, Jason W eston, and Oksana Y akhnenko. Translating embeddings for modeling multi-relational data. In NIPS , 2013. [3] Antoine Bordes, Jason W eston, Ronan Collobert, and Y oshua Bengio. Learning structured embeddings of knowledge bases. In AAAI , 2011. [4] Kai-W ei Chang, W en-tau Y ih, Bishan Y ang, and Chris Meek. T yped tensor decomposition of knowledge bases for relation e xtraction. In EMNLP , 2014. [5] Li Deng, G. Hinton, and B. Kingsbury . Ne w types of deep neural network learning for speech recognition and related applications: An overvie w. In in ICASSP , 2013. [6] Pedro Domingos. Prospects and challenges for multi-relational data mining. ACM SIGKDD explor ations newsletter , 5(1):80–83, 2003. [7] Xin Luna Dong, K Murphy , E Gabrilovich, G Heitz, W Horn, N Lao, Thomas Strohmann, Shaohua Sun, and W ei Zhang. Knowledge vault: A W eb-scale approach to probabilistic knowl- edge fusion. In KDD , 2014. [8] John Duchi, Elad Hazan, and Y oram Singer . Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Resear ch , 12:2121–2159, 2011. [9] Jianfeng Gao, Patrick Pantel, Michael Gamon, Xiaodong He, Li Deng, and Y elong Shen. Mod- eling interestingness with deep neural networks. In EMNLP , 2014. [10] Lise Getoor and Ben T askar , editors. Intr oduction to Statistical Relational Learning . The MIT Press, 2007. [11] Geof f Hinton, L. Deng, D. Y u, G. Dahl, A. Mohamed, N. Jaitly , A. Senior , V . V anhoucke, P . Nguyen, T . Sainath, and B. Kingsb ury . Deep neural netw orks for acoustic modeling in speech recognition. IEEE Sig. Pr oc. Mag. , 29:82–97, 2012. [12] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for W eb search using clickthrough data. In CIKM , 2013. [13] B Hutchinson, L. Deng, and D. Y u. T ensor deep stacking networks. IEEE T ransactions on P attern Analysis and Machine Intelligence , 35(8):1944–1957, 2013. [14] Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes, and Guillaume Obozinski. A latent factor model for highly multi-relational data. In NIPS , 2012. [15] Charles K emp, Joshua B T enenbaum, Thomas L Grif ﬁths, T akeshi Y amada, and Naonori Ueda. Learning systems of concepts with an inﬁnite relational model. In AAAI , volume 3, page 5, 2006. [16] Ni Lao, T om Mitchell, and W illiam W Cohen. Random w alk inference and learning in a large scale knowledge base. In EMNLP , pages 529–539, 2011. [17] T omas Mikolov , Ilya Sutske ver , Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality . In NIPS , pages 3111–3119, 2013. [18] Maximilian Nickel, V olker Tresp, and Hans-Peter Kriegel. A three-way model for collectiv e learning on multi-relational data. In ICML , pages 809–816, 2011. [19] Maximilian Nickel, V olker Tresp, and Hans-Peter Kriegel. Factorizing Y A GO: scalable ma- chine learning for linked data. In WWW , pages 271–280, 2012. [20] Alberto Paccanaro and Geoffrey E. Hinton. Learning distributed representations of concepts using linear relational embedding. IEEE T ransactions on Knowledge and Data Engineering , 13(2):232–244, 2001. 5 [21] Matthe w Richardson and Pedro Domingos. Marko v logic networks. Machine learning , 62(1- 2):107–136, 2006. [22] Y elong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with con volutional-pooling structure for information retrie val. In CIKM , 2014. [23] Y elong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gr ´ egoire Mesnil. Learning semantic representations using con volutional neural networks for W eb search. In WWW , pages 373–374, 2014. [24] Ajit P Singh and Geof frey J Gordon. Relational learning via collective matrix factorization. In KDD , pages 650–658. A CM, 2008. [25] Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y . Ng. Reasoning with neural tensor networks for kno wledge base completion. In NIPS , 2013. [26] Ilya Sutskev er, Joshua B T enenbaum, and Ruslan Salakhutdinov . Modelling relational data using Bayesian clustered tensor factorization. In NIPS , pages 1821–1828, 2009. [27] O. V inyals, Y . Jia, L. Deng, and T . Darrell. Learning with recursive perceptual representations. In NIPS , 2012. [28] W en-tau Y ih, Xiaodong He, and Christopher Meek. Semantic parsing for single-relation ques- tion answering. In ACL , 2014. [29] D. Y u, L. Deng, and F . Seide. The deep tensor neural network with applications to large vocab ulary speech recognition. IEEE T rans. Audio, Speech and Language Proc. , 21(2):388 –396, 2013. A ppendix Figure 1 and 2 illustrate the relation embeddings learned by D I S T M U LT and D I S T A D D using t- SNE. W e selected 189 relations in the FB15k-401 dataset. The embeddings learned by D I S T M U LT nicely reﬂect the clustering structures among these relations (e.g. /ﬁlm/release region is closed to /ﬁlm/country); whereas the embeddings learned by D I S T A D D present structure that is harder to interpret. Figure 1: Relation embeddings ( D I S T A D D ) T able 5 shows some concrete examples: top 3 nearest neighbors in the relation space learned by D I S T M U LT and D I S T A D D along with the distance values (Frobenius distance between two relation matrices or Euclidean distance between two relation vectors). W e can see that the nearest neighbors found by D I S T M U LT are much more meaningful. D I S T A D D tends to retrieve irrelev ant relations which take in completely dif ferent types of arguments. 6 Figure 2: Relation embeddings ( D I S T M U LT ) D I S T M U LT D I S T A D D /ﬁlm distributor/ﬁlm /ﬁlm/distributor (2.0) /production company/ﬁlms (3.4) /ﬁlm/production companies (3.4) /production company/ﬁlms (2.6) /award nominee/nominated for (2.7) /award winner/honored for (2.9) /ﬁlm/ﬁlm set decoration by /ﬁlm set designer/ﬁlm sets designed (2.5) /ﬁlm/ﬁlm art direction by (6.8) /ﬁlm/ﬁlm production design by (9.6) /award nominated work/award nominee (2.7) /ﬁlm/ﬁlm art direction by (2.7) /award winning work/award winner (2.8) /organization/leadership/role /leadership/organization (2.3) /job title/company (12.5) /business/employment tenure/title (13.0) /organization/currenc y (3.0) /location/containedby (3.0) /univ ersity/currency (3.0) /person/place of birth /location/people born here (1.7) /person/places liv ed (8.0) /people/marriage/location of ceremony (14.0) /us county/county seat (2.6) /administrativ e division/capital (2.7) /educational institution/campuses (2.8) T able 5: Examples of nearest neighbors of relation embeddings 7

Learning Multi-Relational Semantics Using Neural-Embedding Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment