A Factorization Machine Framework for Testing Bigram Embeddings in Knowledgebase Completion

Embedding-based Knowledge Base Completion models have so far mostly combined distributed representations of individual entities or relations to compute truth scores of missing links. Facts can however also be represented using pairwise embeddings, i.…

Authors: Johannes Welbl, Guillaume Bouchard, Sebastian Riedel

A F actorization Machine Framework f or T esting Bigram Embeddings in Knowledgebase Completion Johannes W elbl , Guillaume Bouchard and Sebastian Riedel Uni versity Colle ge London London, UK { j.welbl, g.bouchard, s.riedel } @cs.ucl.ac.uk Abstract Embedding-based Knowledge Base Comple- tion models have so far mostly combined dis- tributed representations of individual entities or relations to compute truth scores of miss- ing links. Facts can howe ver also be repre- sented using pairwise embeddings, i.e. em- beddings for pairs of entities and relations. In this paper we explore such bigram embed- dings with a flexible F actorization Machine model and several ablations from it. W e in- vestigate the relev ance of various bigram types on the fb15k237 dataset and find relative improv ements compared to a compositional model. 1 Introduction Present day Kno wledge Bases (KBs) such as Y A GO (Suchanek et al., 2007), Freebase (Bollacker et al., 2008) or the Google Knowledge V ault (Dong et al., 2014) provide immense collections of struc- tured kno wledge. Relationships in these KBs of- ten exhibit regularities and models that capture these can be used to predict missing KB entries. A com- mon approach to KB completion is via tensor fac- torization, where a collection of fact triplets is rep- resented as a sparse mode- 3 tensor which is decom- posed into sev eral low-rank sub-components. T ex- tual relations, i.e. relations between entity pairs ex- tracted from text, can aid the imputation of missing KB facts by modelling them together with the KB relations (Riedel et al., 2013). The general merit of factorization methods for KB completion has been demonstrated by a v ariety of models, such as RESCAL (Nickel et al., 2011), T ransE (Bordes et al., 2013) and DistMult (Y ang et al., 2014). These models learn distributed represen- tations for entities and relations (be it as vector or as matrix) and infer the truth value of a fact by com- bining embeddings for these constituents in an ap- propriate composition function. Most of these factorization models ho we ver oper- ate on the lev el of embeddings for single entities and relations. The implicit assumption here is that facts are compositional , i.e. that the subject, relation and object of a fact are its atomic constituents. Seman- tic aspects rele v ant for imputing its truth can directly be recov ered from its constituents when composing their respecti ve embeddings in a score. For further notation let E and R be sets of enti- ties and relations, respectiv ely . W e denote a fact f stating a relation r ∈ R between subject s ∈ E and object o ∈ E as f = ( s, r, o ) . Our goal is to learn embeddings for lar ger sub-constituents of f than just s, r , and o : we want to learn em- beddings also for the entity pair bigram ( s, o ) as well as the relation-entity bigrams ( s, r ) and ( r , o ) . As an example, consider Freebase facts with rela- tion eating/practicer of diet/diet and object Veganism . Ov erall only two objects are ob- served for this relation and it thus makes sense to learn a joint embedding for bigrams ( r , o ) together, instead of distinct embeddings for each atom alone and then having to learn their compatibility . While Riedel et al. (2013) have trained embed- dings only for entity pairs, we will in this pa- per explore the role of general bigram embeddings for KB completion, i.e. also the embeddings for other possible pairs of entities and relations. This is achiev ed using a Factorization Machine (FM) frame work (Rendle, 2010) that is modular in its feature components, allo wing us to selectiv ely add or discard certain bigram embeddings and compare their relative importance. All models are empirically compared and e v aluated on the fb15k237 dataset from T outano va et al. (2015). In summary , our main contributions are: i) Adressing the question of generic bigram embed- dings in a KB completion model for the first time; ii) The adaption of Factorization Machines for this matter; iii) Experimental findings for comparing dif ferent bigram embedding models on fb15k237 . 2 Related W ork In the Univ ersal Schema model (model F ), Riedel et al. (2013) factorize KB entries together with rela- tions of entity pairs extracted from text, embedding textual relations in the same vector space as KB re- lations. Singh et al. (2015) extend this model to include a v ariety of other interactions between en- tities and relations, using different relation v ectors to interact with subject, object or both. Jenatton et al. (2012) also recognize the need to integrate rich higher-order interaction information into the score. Like Nick el et al. (2011) ho wev er , their model spec- ifies relationships as relation-specific bilinear forms of entity embeddings. Other embedding methods for KB completion include DistMult (Y ang et al., 2014) with a trilinear score, and T ransE (Bordes et al., 2013) which offers an intriguing geometrical in- tuition. Among the aforementioned methods, em- beddings are mostly learned for individual subjects, relations or objects; merely model F (Riedel et al., 2013) constitutes the exception. Some methods rely on more expressi v e compo- sition functions to deal with non-compositionality or interaction effects, such as the Neural T ensor Networks (Socher et al., 2013) or the recently in- troduced Holographic Embeddings (Nickel et al., 2015). In comparison to the otherwise used (gen- eralized) dot products, the composition functions of these models enable richer interactions between unit constituent embeddings. Howe ver , this comes with the potential disadv antage of presenting less well- behav ed optimisation problems and being slower to train. Factorization Machines ha ve already been ap- plied in a similar setting to ours by Petroni et al. (2015) who use them with conte xtual features for an Open Relation Extraction task, b ut without bigrams. 3 Model 3.1 Brief Recall of Factorization Machines A F actorization Machine (FM) is a quadratic re gres- sion model with low-rank constraint on the quadratic interaction terms. 1 Gi ven a sparse input feature v ec- tor φ = ( φ 1 , . . . , φ n ) T ∈ R n , the FM output predic- tion X ∈ R is X = h v , φ i + n X i,j =1 h w i , w j i · φ i φ j (1) where v ∈ R n , and ∀ i, j = 1 , . . . n : w i , w j ∈ R k are model parameters with k  n and h· , ·i denotes the dot product. Instead of allowing for an individ- ual quadratic interaction coef ficient per pair ( i, j ) , the FM assumes that the matrix of quadratic interac- tion coefficients has lo w rank k ; thus the interaction coef ficient for feature pair ( i, j ) is represented by an inner product of k -dimensional vectors w i and w j . The low rank constraint (i) pro vides a strong form of regularisation to this otherwise ov er-parameterized model, (ii) pools statistical strength for estimating similarly profiled interaction coef ficients and (iii) re- tains a total number of parameters linear in n . In summary , with a FM one can ef ficiently harness a large set of sparse features and interactions between them while retaining linear memory complexity . 3.2 F eature Repr esentation f or F acts For the KB completion task we will use a FM with unit and bigram indicator features to learn lo w-rank embeddings for both. T o formalize this, we will refer to the elements of the set U f = { s, r, o } as units of fact f , and to the elements of B f = { ( s, r ) , ( r, o ) , ( o, s ) } as bigrams of fact f . Let ι u ∈ R | E | + | R | be the one-hot indicator vector that encodes a particular unit 2 u ∈ ( E ∪ R ) . Furthermore we de- fine ι ( s,r ) ∈ R | E || R | , ι ( r,o ) ∈ R | R || E | and ι ( o,s ) ∈ R | E | 2 1 W e disreg ard the more general extension to higher-order interactions that is described in the original FM paper and only consider the quadratic case. Also, we omit the global model bias as we found that it was not helpful for our task empirically . 2 For subject and object the same entity embedding is used. to be the one-hot indicator v ectors encoding par- ticular bigrams. Our feature vector φ ( f ) for fact f = ( s, r , o ) then consists of simply the concatena- tion of indicator v ectors for all its units and bigrams: φ ( f ) = concat ( ι s , ι r , ι o , ι ( s,r ) , ι ( r,o ) , ι ( o,s ) ) (2) This sparse set of features pro vides a rich represen- tation of a fact with indicators for subject, relation and object, as well as any pair thereof. 3.3 Scoring a F act Harnessing the expressi v e benefits of a sigmoid link function for relation modelling (Bouchard et al., 2015), we define the truth score of a fact as g ( f ) = σ ( X f ) where σ is the sigmoid function and X f is gi ven as output of the FM model (1) with unit and bigram features φ ( f ) as defined in (2): X f = h φ ( f ) , v i + n X i,j =1 h w i , w j i · φ i ( f ) φ j ( f ) (3) Since our feature vector φ ( f ) is sparse with only six activ e entries, we can re-express (3) in terms of the acti v ated embeddings which we directly index by their respecti ve units and bigrams: X f = X c ∈ ( U f ∪ B f ) v c + X c 1 ,c 2 ∈ ( U f ∪ B f ) h w c 1 , w c 2 i (4) This score comprises all possible interactions be- tween any of the units and bigrams of f . 3.4 Model Ablations f or In vestigating Particular Bigram Embeddings The score (4) can easily be modified and individual summands remo ved from it. In particular , when dis- carding all b ut one summand, model F is recovered, i.e. with c 1 = ( s, o ) ; c 2 = r . On the other hand, al- ternati ves to model F with other bigrams than entity pairs can be tested by removing all summands but the one of a single bigram b ∈ B f vs. the remaining complementary unit u ∈ U f : X u,b f = v u + v b + h w u , w b i (5) This general formulation of fers us a method for in- vestigating the relati ve impact of all combinations of bigram vs. unit embeddings besides model F , namely the models with u = s ; b = ( r, o ) and with u = o ; b = ( s, r ) . 3.5 T raining Objectiv e Gi ven sets of true training facts Ω + and sampled negati v e facts Ω − , we minimize the follo wing loss: − X f ∈ Ω + log(1 + e X f ) + 1 η X f ∈ Ω − log(1 + e X f ) (6) where the parametrization of X f is learned. W e use the hyperparameter η ∈ R + for denoting the ratio of negativ e facts that are sampled per positiv e fact so that the contrib utions of true and false facts are balanced e ven if there are more ne gativ e facts than positi ves. The loss differs from a standard negati ve log-likelihood objective with logistic link, but we found that it performs better in practice. The intu- ition comes from the fact that instead of penalizing badly classified positi ve facts, we put more emphasis (i.e. negati ve loss) on positive facts that are correctly classified. Since we used an L 2 regularization and the loss is asymptotically linear , the resulting objec- ti ve is continuous and bounded from below , guaran- teeing a well defined local minimum. 4 Experiments The bigram embedding models are tested on fb15k237 (T outanov a et al., 2015), a dataset com- prising both Freebase facts and lexicalized depen- dency path relationships between entities. T raining Details and Evaluation W e optimized the loss using AdaM (Kingma and Ba, 2015) with minibatches of size 1024, using initial learn- ing rate 1.0 and initialize model parameters from N (0 , 1) . Furthermore, a hyperparameter τ < 1 like in (T outanov a et al., 2015) is introduced to dis- count the importance of te xtual mentions in the loss. When sampling a negati v e f act we alter the object of a gi ven training fact ( s, r, o ) at random to o 0 ∈ E , and repeat this η times, sampling negati ve facts ev- ery epoch ane w . There is a small implied risk of sampling positiv e facts as negati ve, but this is rare and the discounted loss weight of negati ve samples mitigates the issue further . Hyperparameters ( L 2 - regularisation, η , τ , latent dimension k ) are selected in a grid search for minimising Mean Reciprocal Rank ( MRR ) on a fixed random subsample of size 1000 of the validation set. All reported results are for the test set. W e use the competiti ve unit model ov erall HITS@ MRR Model τ 1 3 10 ov erall no TM with TM DistMult 0.0 18.2 27.0 37.9 24.8 28.0 16.2 full FM 0.0 20.1 28.7 38.9 26.4 29.3 18.3 (*) ( s, o ) vs. r 1.0 2.1 3.8 6.5 3.5 0.0 13.1 (**) ( r , o ) vs. s 0.1 24.9 34.8 45.8 32.0 34.7 24.8 (***) ( s, r ) vs. o 0.0 9.0 17.3 29.9 15.6 17.3 10.9 (*) + (**) + (***) 0.1 25.9 36.2 47.4 33.2 35.0 28.3 T able 1: T est set metrics for dif ferent models and varying unit and bigram embeddings on fb15k237 , all performance numbers in % and best result in bold. The optimal v alue for τ is indicated as well. DistMult as baseline and employ the same ranking e valuation scheme as in (T outanov a et al., 2015) and (T outanov a and Chen, 2015), computing filtered MRR and HITS scores whilst ranking true test facts among candidate facts with altered object. P articular bigrams that ha ve not been observ ed during training hav e no learned embedding; a 0-embedding is used for these. This nullifies their impact on the score and models the back-of f to using nonzero embeddings. Results T able 1 gi ves an o vervie w of the general results for the dif ferent models. Clearly , some of the bigram models can obtain an impro vement over the unit DistMult model. In a more fine-grained analy- sis of model performances, characterized by whether entity pairs of test facts had textual mentions a vail- able in training ( with TM ) or not ( without TM ), the results e xhibit a similar pattern lik e in (T outanov a et al., 2015): most models perform worse on test facts with TM, only model F , which can learn very lit- tle without relations has a re versed behavior . A side observ ation is that sev eral models achiev ed highest ov erall MRR with τ = 0 , i.e. when not using TM. The sum of the three more light-weight bigram models performs better than the full FM, ev en though the same types of embeddings are used. A possible explanation is that applying the same em- bedding in sev eral interactions with other embed- dings (as in the full FM) instead of only one interac- tion (like in (*)+(**)+(***)) makes it harder to learn since its multiple functionalities are competing. Another interesting finding is that some bigram types achiev e much better results than others, in par - ticular model (**). A possible explanation becomes apparent with closer inspection of the test set: a gi ven test fact f usually contains at least one bigram b ∈ B f which has never been observed yet. In these cases the bigram embedding is 0 by design and only the of fset values are used. The proportions of test facts for which this happens are 73 %, 10 % and 24 % respecti vely for the bigrams ( s, o ) , ( r, o ) , and ( s, r ) . Thus models (**) and (***) already hav e a definite adv antage ov er model (*) that originates purely from the nature of the data. A tri vial but someho w im- portant lesson we can learn from this is that if we kno w about the relati ve pre valence of dif ferent bi- grams (or more generally: sub-tuples) in our dataset, we can incorporate and exploit this in the sub-tuples we choose. Finally , for the initial example with relation eating/practicer of diet/diet and ob- ject Veganism , we indeed find that in all instances model (**) with its ( r, o ) embedding gi ves the cor- rect fact in the top 2 predictions, while the purely compositional DistMult model ranks it far outside the top 10. More generally , cases in which only a single object co-appeared with a test fact relation during training had 95 , 3 % HITS@1 with model (**) while only 52 , 6 % for DistMult. This supports the intuition that bigram embeddings of ( r, o ) are in fact better suited for cases in which very few objects are possible for a relation. 5 Conclusion W e have demonstrated that FM provide an approach to KB completion that can incorporate embeddings for bigrams naturally . The FM offers a compact uni- fied framew ork in which v arious tensor factorization models can be expressed, including model F . Extensi ve experiments ha v e demonstrated that bi- gram models can improve prediction performances substantially ov er more straightforward unigram models. A surprising but important result is that bi- grams other than entity pairs are particularly appeal- ing. The bigger question behind our work is about compositionality vs. non-compositionality in a broader class of knowledge bases in v olving higher order information such as time, origin or context in the tuples. Deciding which modes should be merged into a high order embedding without ha ving to rely on heavy cross-v alidation is an open question. Acknowledgments W e thank Th ´ eo Trouillon, T im Rockt ¨ aschel, Pon- tus Stenetorp and Thomas Demeester for discus- sions and hints, as well as the re viewers for com- ments. This work was supported by an EPSRC stu- dentship, an Allen Distinguished In vestigator A ward and a Marie Curie Career Integration A ward. References [Bollacker et al.2008] Kurt Bollacker , Colin Evans, Prav een Paritosh, T im Sturge, and Jamie T aylor . 2008. Freebase: a collaborativ ely created graph database for structuring human knowledge. In SIG- MOD 08 Proceedings of the 2008 A CM SIGMOD international conference on Management of data , pages 1247–1250. [Bordes et al.2013] Antoine Bordes, Nicolas Usunier , Alberto Garcia-Dur ´ an, Jason W eston, and Oksana Y akhnenko. 2013. Translating embeddings for mod- eling multi-relational data. In NIPS 26 . [Bouchard et al.2015] Guillaume Bouchard, Sameer Singh, and Theo T rouillon. 2015. On approximate reasoning capabilities of low-rank vector spaces. In AAAI Spring Syposium on Knowledge Repr esentation and Reasoning (KRR): Inte grating Symbolic and Neural Appr oaches . [Dong et al.2014] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilk o Horn, Ni Lao, Ke vin Murphy , Thomas Strohmann, Shaohua Sun, and W ei Zhang. 2014. Kno wledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ’14, pages 601–610, New Y ork, NY , USA. A CM. [Jenatton et al.2012] Rodolphe Jenatton, Nicolas L. Roux, Antoine Bordes, and Guillaume R Obozinski. 2012. A latent factor model for highly multi- relational data. In NIPS 25 , pages 3167–3175. Curran Associates, Inc. [Kingma and Ba2015] Diederik P . Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimiza- tion. The International Confer ence on Learning Rep- r esentations (ICLR) . [Nickel et al.2011] Maximilian Nickel, V olker Tresp, and Hans peter Kriegel. 2011. A three-way model for collectiv e learning on multi-relational data. In Lise Getoor and T obias Scheffer , editors, Pr oceedings of the 28th International Confer ence on Mac hine Learn- ing (ICML-11) , pages 809–816, Ne w Y ork, NY , USA. A CM. [Nickel et al.2015] Maximilian Nickel, Lorenzo Rosasco, and T omaso Poggio. 2015. Holographic Embeddings of Knowledge Graphs. T echnical report, arXiv , Octo- ber . [Petroni et al.2015] Fabio Petroni, Luciano Del Corro, and Rainer Gemulla. 2015. Core: Context-aw are open relation extraction with factorization machines. In Llus Mrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Y uval Marton, editors, EMNLP , pages 1763–1773. The Association for Computational Lin- guistics. [Rendle2010] Stef fen Rendle. 2010. Factorization ma- chines. In Data Mining (ICDM), 2010 IEEE 10th In- ternational Conference on , pages 995–1000. IEEE. [Riedel et al.2013] Sebastian Riedel, Limin Y ao, Ben- jamin M. Marlin, and Andrew McCallum. 2013. Re- lation extraction with matrix factorization and univer - sal schemas. In Joint Human Language T echnology Confer ence/Annual Meeting of the North American Chapter of the Association for Computational Linguis- tics (HLT -N AACL ’13) , June. [Singh et al.2015] Sameer Singh, Tim Rocktaschel, and Sebastian Riedel. 2015. T owards combined matrix and tensor factorization for univ ersal schema relation extraction. In N AACL W orkshop on V ector Space Modeling for NLP . [Socher et al.2013] Richard Socher , Danqi Chen, Christo- pher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base com- pletion. In NIPS 26 . [Suchanek et al.2007] Fabian M. Suchanek, Gjergji Kas- neci, and Gerhard W eikum. 2007. Y A GO: a core of semantic kno wledge unifying W ordNet and W ikipedia. In WWW ’07: Pr oceedings of the 16th International W orld W ide W eb Confer ence, Banff, Canada , pages 697–706. [T outanov a and Chen2015] Kristina T outanov a and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference. In W orkshop on Continuous V ector Space Models and Their Composi- tionality (CVSC) . [T outanov a et al.2015] Kristina T outanova, Danqi Chen, Patrick Pantel, Hoifung Poon, P allavi Choudhury , and Michael Gamon. 2015. Representing text for joint embedding of text and knowledge bases. In Empirical Methods in Natural Language Pr ocessing (EMNLP) . A CL Association for Computational Lin- guistics, September . [Y ang et al.2014] Bishan Y ang, W en-tau Y ih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding en- tities and relations for learning and inference in knowl- edge bases. CoRR , abs/1412.6575.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment