ProjE: Embedding Projection for Knowledge Graph Completion
With the large volume of new information created every day, determining the validity of information in a knowledge graph and filling in its missing parts are crucial tasks for many researchers and practitioners. To address this challenge, a number of…
Authors: Baoxu Shi, Tim Weninger
ProjE: Embedding Projection for Kno wledge Graph Completion Baoxu Shi 1 and T im W eninger 1 1 Department of Computer Science and Engineering, University of Notr e Dame Abstract W ith the large v olume of new information created e very day , determining the v alidity of information in a kno wledge graph and filling in its missing parts are crucial tasks for many researchers and practitioners. T o address this challenge, a number of kno wledge graph completion methods ha ve been de veloped using low-dimensional graph embeddings. Although researchers continue to improv e these models using an increasingly complex feature space, we sho w that simple changes in the architecture of the underlying model can outperform state-of-the-art models without the need for comple x feature engineering. In this work, we present a shared variable neural network model called ProjE that fills-in missing information in a knowledge graph by learning joint embeddings of the knowledge graph’ s entities and edges, and through subtle, but important, changes to the standard loss function. In doing so, ProjE has a parameter size that is smaller than 11 out of 15 existing methods while performing 37% better than the current-best method on standard datasets. W e also show , via a ne w fact checking task, that ProjE is capable of accurately determining the veracity of man y declarative statements. Kno wledge Graphs (KGs) ha ve become a crucial resource for many tasks in machine learning, data mining, and artificial intelligence applications including question answering [ 34 ], entity disambiguation [ 7 ], named entity linking [ 14 ], fact checking [ 32 ], and link prediction [ 28 ] to name a fe w . In our view , KGs are an example of a heterogeneous information network containing entity-nodes and relationship-edges corresponding to RDF-style triples h h, r , t i where h represents a head entity , and r is a relationship that connects h to a tail entity t . KGs are widely used for many practical tasks, howe ver , their correctness and completeness are not guaranteed. Therefore, it is necessary to develop knowledg e graph completion (KGC) methods to find missing or errant relationships with the goal of impro ving the general quality of KGs, which, in turn, can be used to improv e or create interesting downstream applications. The KGC task can be divided into two non-mutually e xclusiv e sub-tasks: (i) entity prediction and (ii) relationship prediction. The entity prediction task takes a partial triple h h, r , ? i as input and produces a ranked list of candidate entities as output: Definition 1. (Entity Ranking Pr oblem) Given a Knowledge Graph G = { E , R } and an input triple h h, r , ? i , the entity ranking problem attempts to find the optimal order ed list such that ∀ e j ∀ e i (( e j ∈ E − ∧ e i ∈ E + ) → e i ≺ e j ) , wher e E + = { e ∈ { e 1 , e 2 , . . . , e l }|h h, r , e i ∈ G } and E − = { e ∈ { e l +1 , e l +2 , . . . , e | E | }|h h, r , e i / ∈ G } . Distinguishing between head and tail-entities is usually arbitrary , so we can easily substitute h h, r , ? i for h ? , r , t i . The r elationship prediction task aims to find a ranked list of relationships that connect a head-entity with a tail-entity , i.e . , h h, ? , t i . When discussing the details of the present work, we focus specifically on the entity prediction task; ho wever , it is straightforw ard to adapt the methodology to the relationship prediction task by changing the input. A number of KGC algorithms have been de veloped in recent years, and the most successful models all hav e one thing in common: they use low-dimensional embedding vectors to represent entities and relationships. Many embedding models, e.g. , Unstructured [ 3 ], T ransE [ 4 ], T ransH [ 35 ], and T ransR [ 25 ], use a mar gin-based pairwise ranking loss function, which measures the score of each possible result as the L n -distance between h + r and t . In these models the loss functions are all the same, so models dif fer in how they transform the 1 entity embeddings h and t with respect to the relationship embeddings r . Instead of simply adding h + r , more expressi ve combination operator s are learned by Knowledge V ault [ 8 ] and HolE [ 29 ] in order to predict the existence of h h, r , t i in the KG. Other models, such as the Neural T ensor Network (NTN) [ 33 ] and the Compositional V ector Space Model (CVSM) [ 27 ], incorporate a multilayer neural network solution into the existing models. Unfortunately , due to their extremely large parameter size, these models either (i) do not scale well or (2) consider only a single relationship at a time [10] thereby limiting their usefulness on large, real-w orld KGs. Despite their large model size, the aforementioned methods only use singleton triples, i.e . , length-1 paths in the KG. PTransE [ 24 ] and R T ransE [ 10 ] employ extended path information from 2 and 3-hop trails over the knowledge graph. These extended models achiev e excellent performance due to the richness of the input data; unfortunately , their model-size gro ws exponentially as the path-length increases, which further exacerbates the scalability issues associated with the already high number of parameters of the underlying-models. Another curious finding is that some of the existing models are not self-contained models , i.e. , they require pre-trained KG embeddings (R T ransE, CVSM), pre-selected paths (PT ransE, R T ransE), or pre-computed content embeddings of each node (DKRL [ 36 ]) before their model training can e ven be gin. T ransR and TransH are self-contained models, b ut their experiments only report results using pre-trained T ransE embeddings as input. W ith these considerations in mind, in the present work we rethink some of the basic decisions made by pre vious models to create a pr ojection embedding model (ProjE) for KGC. ProjE has four parts that distinguish it from the related work: 1. Instead of measuring the distance between input triple h h, r , ? i and entity candidates on a unified or a relationship-specific plane, we choose to project the entity candidates onto a target v ector representing the input data. 2. Unlike existing models that use transformation matrices, we combine the embedding v ectors representing the input data into a target v ector using a learnable combination operator . This av oids the addition of a large number of transformation matrices by reusing the entity-embeddings. 3. Rather than optimizing the margin-based pairwise ranking loss, we optimize a ranking loss of the list of candidate-entities (or relationships) collectively . W e further use candidate sampling to handle very large data sets. 4. Unlike many of the related models that require pre-trained data from prerequisite models or explore expensi ve multi-hop paths through the kno wledge graph, ProjE is a self-contained model ov er length-1 edges. 1 Related W ork A v ariety of low-dimensional representation-based methods have been developed to w ork on the KGC task. These methods usually learn continuous, lo w-dimensional vector representations ( i.e . , embeddings) for entities W E and relationships W R by minimizing a margin-based pairwise ranking loss [24]. The most widely used embedding model in this category is T ransE [ 4 ], which views relationships as translations from a head entity to a tail entity on the same low-dimensional plane. The energy function of T ransE is defined as E ( h, r, t ) = k h + r − t k L n , (1) which measures the L n -distance between a translated head entity h + r and some tail entity t . The Unstructured model [3] is a special case of T ransE where r = 0 for all relationships. Based on the initial idea of treating tw o entities as a translation of one another (via their relationship) in the same embedding plane, several models ha ve been introduced to improve the initial T ransE model. The newest contributions in this line of work focus primarily on the changes in how the embedding planes are computed and/or ho w the embeddings are combined. For e xample, the entity translations in T ransH [ 35 ] are computed on a hyperplane that is perpendicular to the relationship embedding. In TransR [ 25 ] the entities and relationships are embedded on separate planes and then the entity-v ectors are translated to the relationship’ s plane. Structured Embedding (SE) [ 5 ] creates two translation matrices for each relationship and applies them 2 to head and tail entities separately . Kno wledge V ault [ 8 ] and HolE [ 29 ], on the other hand, focus on learning a new combination operator instead of simply adding tw o entity embeddings element-wise. The aforementioned models are all geared toward link prediction in KGs, and the y all minimize a margin- based pairwise ranking loss function L ov er the training data S : L ( S ) = Σ ( h,r,t ) ∈ S [ γ + E ( h, r, t ) − E ( h 0 , r 0 , t 0 )] + , (2) where E ( h, r, t ) is the energy function of each model, γ is the margin, and ( h 0 , r 0 , t 0 ) denotes some “corrupted” triple which does not exist in S . Unlike aforementioned models that focus on different E ( h, r, t ) , T ransA [ 19 ] introduces an adaptive local margin approach that determines γ by a closed set of entity candidates. Other similar models include RESCAL [ 30 ], Semantic Matching Ener gy (SME) [ 3 ], and the Latent F actor Model (LFM) [18]. The Neural T ensor Network (NTN) model [ 33 ] is an exception to the basic energy function in Eq. 1. Instead, NTN uses an energy function E ( h, r, t ) = u T r f ( h T W r t + W rh h + W rt t + b r ) , (3) where u r , W r , W rh , and W rt are all relationship-specific variables. As a result, the number of parameters in NTN is significantly larger than other methods. This makes NTN unsuitable for networks with ev en a moderate number of relationships. So far , the related models ha ve only considered triples that contain a single relationship. More complex models ha ve been introduced to lev erage path and content information in KGs. For instance, the Compositional V ector Space Model (CVSM) [ 27 ] composes a sequence of relationship embeddings into a single path embedding using a Recurrent Neural Network (RNN). Howe ver , this has two disadvantages: (i) CVSM needs pre-trained relationship embeddings as input, and (ii) each CVSM is specifically trained for only a single relationship type. This makes CVSM perform well in specific tasks, but unsuitable for generalized entity and relationship prediction tasks. R T ransE [ 10 ] solves the relationship-specific problem in CVSM by using entity and relationship embeddings learned from T ransE. Howe ver , it is hard to compare R T ransE with existing methods because it requires unambiguous, pre-selected paths as inputs called quadruples h h, r 1 , r 2 , t i further complicating the model. DKRL, like NTN, uses w ord embeddings of entity-content in addition to multi-hop paths, but relies on the machinery of a Con volution Neural Netw ork (CNN) to learn entity and relationship embeddings. PT ransE [ 24 ] is another path-based method that uses path information in its energy function. Simply put, PT ransE doubles the number of edges in the KG by creating reverse relationships for e very e xisting relationship in the KG. Then PT ransE uses PCRA [37] to select input paths within a giv en length constraint. T able 1 shows a breakdo wn of the parameter comple xity of each model. As is typical, we find that more complex models achie ve better prediction accuracy , b ut are also more dif ficult to train and ha ve trouble scaling. The proposed method, ProjE, has a number of parameters that is smaller than 11 out of 15 methods and does not require any prerequisite training. 2 Methodology The present work vie ws the KGC problem as a ranking task and optimizes the collecti ve scores of the list of candidate entities. Because we want to optimize the ordering of candidate entities collectively , we need to project the candidate entities onto the same embedding vector . For this task we learn a combination operator that creates a target v ector from the input data. Then, the candidate entities are each projected onto the same target v ector thereby rev ealing the candidate’ s similarity score as a scalar . In this section we describe the ProjE architecture, followed by two proposed v ariants, their loss functions, and our choice of candidate sampling method. In the e xperiments section we demonstrate that ProjE outper- forms all existing methods despite having a relati vely small parameter space. A detailed algorithm description can be found in the Supplementary Material. 3 T able 1: Parameter size and prerequisites of KGC models in increasing order . ProjE, ranked 5 th , is highlighted. n e , n r , n w , k are the number of entities, relationships, words, and embedding size in the KG respectively . z is the hidden layer size. q † represents the number of RNN parameters in R T ransE; this value is not specified, b ut should be 8 k 2 if a normal LSTM is used. Model Parameters Prerequisites increasing model size ← − − − − − − − − − − − − − − − − − − − − − Unstructured n e k - T ransE n e k + n r k - HolE n e k + n r k - PT ransE n e k + n r k PCRA ProjE n e k + n r k + 5 k - CVSM n e k + n r k + 2 k 2 W ord2vec SME (linear) n e k + n r k + 4 k 2 - R T ransE n e k + n r k + q † T ransE, PCR W LFM n e k + n r k + 10 k 2 - SME (bilinear) n e k + n r k + 2 k 3 - T ransH n e k + 2 n r k - RESCAL n e k + n r k 2 - SE n e k + 2 n r k 2 - T ransR n e k + n r ( k + k 2 ) - DKRL n e k + n r k + n w k + 2 zk T ransE, W ord2vec NTN n e k + n r ( z k 2 + 2 z k + 2 z ) - 2.1 Model Architectur e The main insight in the de velopment of ProjE is as follows: giv en two input embeddings, we view the prediction task as ranking problem where the top-ranked candidates are the correct entities. T o generate this ordered list, we project each of the candidates onto a target vector defined by two input embeddings through a combination operator . Existing models, such as Knowledge V ault, HolE, and NTN, define specific matrix combination operators that combine entities and/or relationships. In common practice, these matrices are expected to be sparse. Because we believ e it is unnecessary to hav e interactions among different feature dimensions at this early stage, we constraint our matrices to be diagonal, which are inherently sparse. The combination operator is therefore defined as e ⊕ r = D e e + D r r + b c , (4) where D e and D r are k × k diagonal matrices which serv e as global entity and relationship weights respectively , and b c ∈ R k is the combination bias. Using this combination operator , we can define the embedding projection function as h ( e , r ) = g ( W c f ( e ⊕ r ) + b p ) , (5) where f and g are activ ation functions that we define later , W c ∈ R s × k is the candidate-entity matrix, b p is the projection bias, and s is the number of candidate-entities. h ( e , r ) represents the ranking score v ector , where each element represents the similarity between some candidate entity in W c and the combined input embedding e ⊕ r . Although s is relati vely large, due to the use of shared v ariables, W c is the candidate-entity matrix that contains s rows that exist in the entity embedding matrix W E . Simply put, W c does not introduce any ne w variables into the model. Therefore, compared to simple models like T ransE, ProjE only increases the number of parameters by 5 k + 1 , where 1 , 4 k , and k are introduced as the projection bias, combination weights, and combination bias respecti vely . Later we sho w that by changing dif ferent activ ation functions, ProjE can be either a pointwise ranking model or a listwise ranking model. 4 Figur e 1: ProjE architecture for entity prediction with e xample input h ? , CityOf , Illinois i and tw o candidates. ProjE represents a tw o-layer neural netw ork with a combination layer , and a projection ( i.e . , output) layer . This figure is best vie wed in color . ProjE can be vie wed as a neural netw ork with a combination layer and a projection ( i.e . , output) layer . Figure 1 illustrates this architecture by w ay of an e xample. Gi v en a tail entity Illinois and a relationship CityOf , our task is to calculate the scores of each head entity . The blue nodes are ro w v ector s from the entity embedding matrix W E , and the green nodes are ro w v ectors from the relationship embedding matrix W R ; the orange nodes are the combination operators as diagonal matrices. F or clarity we only illustrate tw o candidates in Fig. 1, ho we v er W c may contain an arbitrary number of candidate-entities. The ne xt step is to define the loss functions used in ProjE. 2.2 Ranking Method and Loss Function As defined in Defn. 1, we vie w the KGC problem as a ranking task where all positi v e candidates precede all ne g ati v e candidates and train our model accordingly . T ypically there are tw o w ays to obtain such an ordering: with either 1) the pointwise method, or 2) the listwise method [ 31 ]. Although most e xisting KGC models, including T ransE, T ransR, T ransH, and HolE use a pairwise ranking loss function during training, their ranking score is ca lculated independently in what is essentially a pointwise method when deplo yed. Based on the architecture we described in pre vious section, we propose tw o methods: 1) ProjE_pointwise, and 2) ProjE_listwise through the use of dif ferent acti v ation functions for g ( · ) and f ( · ) in Eq. 5. First we describe the Pr ojE_pointwise ranking method. Because the relati v e order inside each entity set does not af fect the prediction po wer , we can create a binary label v ector in which all entities in E − ha v e a score of 0 , and all entities in E + ha v e a score of 1 . Because we maximize the lik elihood between the ranking score v ector h ( e , r ) and the binary label v ector , it is intuiti v e to vie w this task as a multi-class classification problem. Therefore, the loss function of ProjE_pointwise can be defined in a f amiliar w ay: L ( e , r , y ) = − X i ∈{ i | y i =1 } log ( h ( e , r ) i ) − X m E j ∼ P y log (1 − h ( e , r ) j ) , (6) where e and r are the input e mbedding v ectors of a t raining instance in S , y ∈ R s is a binary label v ector where y i = 1 means candidate i represents a positi v e label, m is the number of ne g ati v e samples dra wn from a ne g ati v e candidate distrib ution E j ∼ P y (described in ne xt section). Because we vie w ProjE_pointwise as a multiclass classification problem, we use the sigmoid and tanh acti v ation functions as our choice for g ( · ) and f ( · ) respecti v ely . When deplo yed, the ranking score of the i th candidate-entity is: 5 h ( e , r ) i = sigmoid W c [ i, :] tanh ( e ⊕ r ) + b p , (7) where W c [ i, :] represents i th candidate in the candidate-entity matrix. Recently , softmax re gression loss has achieved good results in multi-label image annotation tasks [ 12 , 11 ]. This is because multi-label image annotation, as well as many other classification tasks, should consider their predicted scores collecti vely . Inspired by this way of thinking, we employ the softmax acti vation function in order to classify candidate-entities collecti vely , i.e. , using a listwise method. In this case we define the loss function of ProjE_listwise as: L ( e , r , y ) = − | y | X i 1 ( y i = 1) Σ i 1 ( y i = 1) log ( h ( e , r ) i ) , (8) where the target probability ( i.e. , the target score) of a positive candidate is 1 / (total number of positiv e candidates of the input instance). Similar to Eq. 7, we replace g ( · ) and f ( · ) as softmax and tanh respecti vely , which can be written equiv alently as: h ( e , r ) i = exp( W c [ i, :] tanh( e ⊕ r ) + b p ) P j exp( W c [ j, :] tanh( e ⊕ r ) + b p ) . (9) Later , we perform a comprehensive set of e xperiments that compare ProjE with more than a dozen related models and discuss the proposed ProjE_pointwise and ProjE_listwise variants in depth. 2.3 Candidate Sampling Although ProjE limits the number of additional parameters, the projection operation may be costly due to the large number of candidate-entities ( i.e. , the number of rows in W c ). If we reduce the number of candidate-entities in the training phrase, we could create a smaller working set that only contains a subset of the embedding matrix W E . W ith this in mind, we use candidate sampling to reduce the number of candidate-entities. Candidate sampling is not a new problem; many recent works ha ve addressed this problem in interesting ways [ 16 , 26 , 13 ]. W e experimented with man y choices, and found that the ne gativ e sampling used in W ord2V ec [26] resulted the best performance. For a gi ven entity e , relationship r , and a binary label vector y , we compute the projection with all of the positiv e candidates and only a sampled subset of negativ e candidates from P y following the con v ention of W ord2V ec. For simplicity , P y can be replaced by a (0 , 1) binomial distribution B (1 , p y ) shared by all training instances, where p y is the probability that a ne gativ e candidate is sampled and 1 − p y is the probability that a negati ve candidate is not sampled. For ev ery negati ve candidate in y we sample a v alue from B (1 , p y ) to determine whether we include this candidate in the candidate-entity matrix W c or not. In the Supplementary Material we e valuate the performance of ProjE with different candidate sampling rates p y ∈ { 5% , 25% , 50% , 75% , 95% } . Our experiments show relativ ely consistent performance using negati ve sampling rates as lo w as 25% . 3 Experiments W e ev aluate the ProjE model with entity prediction and relationship prediction tasks, and compare the performance against se veral existing methods using experimental procedures, datasets, and metrics established in the related work. The FB15K dataset is a 15,000-entity subset of Freebase; the Semantic MEDLINE Database (SemMedDB) is a KG extracted from all of PubMed [ 20 ]; and DBpedia is KG extracted from W ikipedia infoboxes [ 23 ]. Using DBpedia and SemMedDB, we also introduce a new fact checking task for a practical case study on the usefulness of these models. ProjE is implemented in Python using T ensorFlo w [ 1 ]; the code and data are av ailable at https://github.com/nddsg/ProjE . 6 T able 2: Entity prediction on FB15K dataset. Missing values indicate scores not reported in the original work. Mean Rank HITS@10 ( % ) Algorithm Raw Filtered Raw Filtered Unstructured 1074 979 4 . 5 6 . 3 RESCAL 828 683 28 . 4 44 . 1 SE 273 162 28 . 8 39 . 8 SME (linear) 274 154 30 . 7 40 . 8 SME (bilinear) 284 158 31 . 3 41 . 3 LFM 283 164 26 . 0 33 . 1 T ransE 243 125 34 . 9 47 . 1 DKRL (CNN) 200 113 44 . 3 57 . 6 T ransH 212 87 45 . 7 64 . 4 T ransR 198 77 48 . 2 68 . 7 T ransE + Rev 205 63 47 . 9 70 . 2 HolE - - - 73 . 9 PT ransE (ADD, len-2 path) 200 54 51 . 8 83 . 4 PT ransE (RNN, len-2 path) 242 92 50 . 6 82 . 2 PT ransE (ADD, len-3 path) 207 58 51 . 4 84 . 6 T ransA 164 58 - - ProjE_pointwise 174 104 56 . 5 86 . 6 ProjE_listwise 146 76 54 . 6 71 . 2 ProjE_wlistwise 124 34 54 . 7 88 . 4 T able 3: Relationship prediction on FB15K dataset. Mean Rank HITS@1 ( % ) Algorithm Raw Filtered Ra w Filtered T ransE 2 . 8 2 . 5 65 . 1 84 . 3 T ransE + Rev 2 . 6 2 . 3 67 . 1 86 . 7 DKRL (CNN) 2 . 9 2 . 5 69 . 8 89 . 0 PT ransE (ADD, len-2 path) 1 . 7 1 . 2 69 . 5 93 . 6 PT ransE (RNN, len-2 path) 1 . 9 1 . 4 68 . 3 93 . 2 PT ransE (ADD, len-3 path) 1 . 8 1 . 4 68 . 5 94 . 0 ProjE_pointwise 1 . 6 1 . 3 75 . 6 95 . 6 ProjE_listwise 1 . 5 1 . 2 75 . 8 95 . 7 ProjE_wlistwise 1 . 5 1 . 2 75 . 5 95 . 6 3.1 Settings For both entity and relationship prediction tasks, we use Adam [ 21 ] as the stochastic optimizer with def ault hyper-parameter settings: β 1 = 0 . 9 , β 2 = 0 . 999 , and = 1 e − 8 . During the training phrase, we apply an L 1 regularizer to all parameters in ProjE and a dropout layer on top of the combination operator to prev ent ov er-fitting. The hyper -parameters in ProjE are the learning rate lr , embedding size k , mini-batch size b , regularizer weight α , dropout probability p d , and success probability for ne gativ e candidate sampling p y . W e set lr = 0 . 01 , b = 200 , α = 1 e − 5 , and p d = 0 . 5 for both tasks, k = 200 , p y = 0 . 5 for the entity prediction task and k = 100 , p y = 0 . 75 for the relationship prediction task. For all tasks, ProjE was trained for at most 100 iterations, and all parameters were initialized from a uniform distribution U [ − 6 √ k , 6 √ k ] as suggested by T ransE [ 4 ]. ProjE can also be initialized with pre-trained embeddings. 7 T able 4: A UC scores of fact checking test cases on DBpedia and SemMedDB. DBPedia SemMedDB Algorithm CapitalOf Company CEO NYT Bestseller US Ci vil W ar US V ice-President Disease Cell Adamic/Adar 0 . 387 0 . 665 0 . 650 0 . 642 0 . 795 0 . 671 0 . 755 Semantic Proximity 0 . 706 0 . 614 0 . 641 0 . 582 0 . 805 0 . 871 0 . 840 SimRank 0 . 553 0 . 824 0 . 695 0 . 685 0 . 912 0 . 809 0 . 749 AMIE 0 . 550 0 . 669 0 . 520 0 . 659 0 . 987 0 . 889 0 . 898 PPR 0 . 535 0 . 579 0 . 529 0 . 488 0 . 683 0 . 827 0 . 885 PCR W 0 . 550 0 . 542 0 . 486 0 . 488 0 . 672 0 . 911 0 . 765 T ransE 0 . 655 0 . 728 0 . 601 0 . 612 0 . 520 0 . 532 0 . 620 PredPath 0 . 920 0 . 747 0 . 664 0 . 749 0 . 993 0 . 941 0 . 928 ProjE 0 . 979 0 . 845 0 . 852 0 . 824 1 . 000 0 . 926 0 . 971 3.2 Entity and Relationship Prediction W e ev aluated ProjE’ s performance on entity and relationship prediction tasks using the FB15K dataset following the e xperiment settings in T ransE [ 4 ] and PTransE [ 24 ]. For entity prediction, we aim to predict a missing h (or t ) for a giv en triple h h, r , t i by ranking all of the entities in the KG. T o create a test set we replaced the head or tail-entity with all entities in the KG, and rank these replacement entities in descending order . For relationship prediction, we replaced the relationship of each test triple with all relationships in the KG, and rank these replacement relationships in descending order . Follo wing con vention, we use mean rank and HITS@k as ev aluation metrics. Mean rank measures the av erage rank of correct entities/relationships. HITS@k measures if correct entities/relationships appear within the top- k elements. The filtered mean rank and filtered HITS@k ignore all other true entities/relationships in the result and only look at the target entity/relationship. For example, if the target relationship between h Springfield , ?, Illinois i is locatedIn , and the top- 2 ranked relationships are capitalOf and locatedIn , then the raw mean rank and HITS@1 of this example would be 2 and 0 respectiv ely , b ut the filter ed mean rank and HITS@1 would both be 1 because the filtered mean rank and filtered HITS@k ignore the correct capitalOf relationship in the results set. In addition to ProjE_pointwise and ProjE_listwise, we also ev aluate ProjE_wlistwise , which is a slight variation of ProjE_listwise that incorporates instance-le vel weights ( Σ i 1 ( y i = 1) ) to increase the importance of N-to-N and N-to-1 (1-to-N) relationships. T able 2 and T ab . 3 sho w that the three ProjE variants outperform e xisting methods in most cases. T able 3 contains fewer models than T ab . 2 because many models do not perform t he relationship prediction task. W e also adapt the pointwise and listwise ranking methods to T ransE using the same hyperparameter settings, but the performance does not impro ve significantly and is not sho wn here. This indicates that the pointwise and listwise ranking methods are not merely simple tricks that can be added to any model to improve performance. Surprisingly , although softmax is usually used in mutually exclusiv e multi-class classification problems and sigmoid is a more natural choice for non-e xclusiv e cases like the KGC task, our results sho w that both ProjE_listwise and ProjE_wlistwise perform better than ProjE_pointwise in most cases. This is because KGC is a special ranking task, where a good model ought to have the follo wing properties: 1) the score of all positiv e candidates should be maximized and the score of all negativ e candidates should be minimized, and 2) the number of positive candidates that are ranked above negativ e candidates should be maximized. By maximizing the similarity between the ranking score vector and the binary label vector , ProjE_pointwise meets the first property but fails to meet the second, i.e. , ProjE_pointwise does not addresses the ranking order of all candidates collectively , because sigmoid is applied to each candidate individually . On the other hand, ProjE_listwise and ProjE_wlistwise successfully addresses both properties by maximizing the similarity between the binary label v ector and the ranking score vector , which is an exponential-normalized ranking score vector that imposes an e xplicit ordering to the candidate-entities collectiv ely . In the Supplementary Material we also examine the stability of the proposed ProjE model and demonstrate that the performance of ProjE increases steadily and smoothly during training. 8 3.3 F act Checking Unlike the entity prediction and relationship prediction tasks that predict randomly sampled triples, we employ a ne w fact checking task that tests the predicti ve po wer of various models on real world questions. W e view the fact checking task as a type of link prediction problem because a fact statement h h, r , t i can be naturally considered as an edge in a KG. W e use ProjE_wlistwise with a small change: rather than using entity embeddings directly , the input v ector of ProjE consists of the predicate paths between the two entities [ 32 ]. W e learn the entity-embeddings by adding an input layer that con verts input predicate paths into the entity-embedding. W e employ the experimental setup and question set from Shi and W eninger (2016) on the DBPedia and SemMedDB data sets. Specifically , we remov e all edges having the same label as the input relationship r and perform fact checking on the modified KG by predicting the existence of r on hundreds of variations of 7 types of questions. For e xample, the CapitalOf question checks v arious claims of the capitals of US states. In this case, we check if each of the 5 most populous cities within each state is its capital. This results in about 5 × 5 = 250 checked facts with an 20/80 positi ve to negati ve label ratio. The odds that some fact statement is true is equi valent to the odds that the fact’ s triple is missing from the KG (rather than purposefully omitted, i.e. , a true neg ativ e). Results in T ab. 4 sho w that ProjE outperforms existing fact checking and link prediction models [2, 6, 17, 9, 15, 22] in all but one question type. 4 Conclusions and Futur e W ork T o recap, the contributions of the present work are as follows: 1) we vie w the KGC task as a ranking problem and project candidate-entities onto a v ector representing a combined embedding of the kno wn parts of an input triple and order the ranking score vector in descending order; 2) we show that by optimizing the ranking score vector collecti vely using the listwise ProjE variation, we can significantly improv e prediction performance; 3) ProjE uses only directly connected, length- 1 paths during training, and has a relati vely simple 2 -layer structure, yet outperforms complex models that ha ve a richer parameter or feature set; and 4) unlike other models ( e .g. , CVSM, R TransE, DKRL), the present w ork does not require any pre-trained embeddings and has many fewer parameters than related models. W e finally show that ProjE can outperform existing methods on fact checking tasks. For future work, we will adapt more complicated neural network models such RNN and CNN with the embedding projection model presented here. It is also possible to incorporate rich feature sets from length-2 and length-3 paths, b ut these would necessarily add additional comple xity . Instead, we plan to use information from complex paths in the KG to clearly summarize the many complicated ways in which entities are connected. Refer ences [1] M. Abadi, A. Agarwal, P . Barham, E. Bre vdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. T ensorflow: Lar ge-scale machine learning on heterogeneous distributed systems. arXiv pr eprint arXiv:1603.04467 , 2016. [2] L. A. Adamic and E. Adar . Friends and neighbors on the W eb . Social Networks , 25(3):211–230, 2003. [3] A. Bordes, X. Glorot, J. W eston, and Y . Bengio. Joint Learning of W ords and Meaning Representations for Open-T ext Semantic Parsing. AIST ATS , pages 127–135, 2012. [4] A. Bordes, N. Usunier , A. García-Durán, J. W eston, and O. Y akhnenko. T ranslating Embeddings for Modeling Multi-relational Data. In NIPS , pages 2787–2795, 2013. [5] A. Bordes, J. W eston, R. Collobert, and Y . Bengio. Learning Structured Embeddings of Knowledge Bases. AAAI , 2011. [6] G. L. Ciampaglia, P . Shiralkar , L. M. Rocha, J. Bollen, F . Menczer , and A. Flammini. Computational fact checking from kno wledge networks. PLoS ONE , 10(6), 2015. 9 [7] S. Cucerzan. Lar ge-scale named entity disambiguation based on wikipedia data. In EMNLP-CoNLL , volume 7, pages 708–716, 2007. [8] X. Dong, E. Gabrilovich, G. Heitz, W . Horn, N. Lao, K. Murphy , T . Strohmann, S. Sun, and W . Zhang. Knowledge v ault: a web-scale approach to probabilistic knowledge fusion. In SIGKDD , Aug. 2014. [9] L. A. Galárrag a, C. T eflioudi, K. Hose, and F . Suchanek. AMIE: association rule mining under incomplete evidence in ontological kno wledge bases. In WWW , pages 413–422, 2013. [10] A. García-Durán, A. Bordes, and N. Usunier . Composing Relationships with Translations. EMNLP , pages 286–290, 2015. [11] Y . Gong, Y . Jia, T . Leung, A. T oshev , and S. Ioffe. Deep conv olutional ranking for multilabel image annotation. arXiv pr eprint arXiv:1312.4894 , 2013. [12] M. Guillaumin, T . Mensink, J. V erbeek, and C. Schmid. T agprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV , pages 309–316. IEEE, 2009. [13] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormal- ized statistical models. In AIST A TS , pages 297–304, 2010. [14] B. Hachey , W . Radford, J. Nothman, M. Honnibal, and J. R. Curran. Ev aluating entity linking with wikipedia. AI , 194:130–150, 2013. [15] T . H. Haveliw ala. T opic-sensitiv e pagerank. In WWW , pages 517–526, 2002. [16] S. Jean, K. Cho, R. Memisevic, and Y . Bengio. On Using V ery Large T arget V ocabulary for Neural Machine T ranslation. ACL , pages 1–10, 2015. [17] G. Jeh and J. W idom. SimRank: a measure of structural context similarity . In KDD , pages 538–543, 2002. [18] R. Jenatton, N. Le Roux, A. Bordes, and G. Obozinski. A latent factor model for highly multi-relational data. NIPS , pages 3176–3184, 2012. [19] Y . Jia, Y . W ang, H. Lin, X. Jin, and X. Cheng. Locally Adaptiv e T ranslation for Knowledge Graph Embedding. In AAAI , 2016. [20] H. Kilicoglu, D. Shin, M. Fiszman, G. Rosemblat, and T . C. Rindflesch. SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics , 28(23):3158–3160, 2012. [21] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. [22] N. Lao and W . W . Cohen. Relational retriev al using a combination of path-constrained random w alks. ML , 81(1):53–67, 2010. [23] J. Lehmann, R. Isele, and M. Jakob . DBpedia - a large-scale, multilingual knowledge base extracted from W ikipedia. Semantic W eb , 5(1):167–195, 2014. [24] Y . Lin, Z. Liu, and M. Sun. Modeling Relation Paths for Representation Learning of Knowledge Bases. EMNLP , pages 705–714, 2015. [25] Y . Lin, Z. Liu, M. Sun, Y . Liu, and X. Zhu. Learning Entity and Relation Embeddings for Kno wledge Graph Completion. In AAAI , pages 2181–2187, 2015. [26] T . Mikolov , I. Sutske ver , K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality . In NIPS , 2013. [27] A. Neelakantan, B. Roth, and A. McCallum. Compositional V ector Space Models for Knowledge Base Inference. AAAI , 2015. 10 [28] M. Nickel, K. Murphy , V . Tresp, and E. Gabrilovich. A revie w of relational machine learning for knowledge graphs: From multi-relational link prediction to automated kno wledge graph construction. arXiv pr eprint arXiv:1503.00759 , 2015. [29] M. Nickel, L. Rosasco, and P . T omaso. Holographic Embeddings of Knowledge Graphs. In AAAI , 2016. [30] M. Nickel, V . T resp, and H.-P . Kriegel. A Three-W ay Model for Collectiv e Learning on Multi-Relational Data. In ICML , pages 809–816, 2011. [31] H. H. Pareek and P . K. Ravikumar . A representation theory for ranking functions. In NIPS , pages 361–369, 2014. [32] B. Shi and T . W eninger . Fact checking in heterogeneous information netw orks. In WWW , pages 101–102, 2016. [33] R. Socher , D. Chen, C. D. Manning, and A. Y . Ng. Reasoning W ith Neural T ensor Networks for Knowledge Base Completion. In NIPS , pages 926–934, 2013. [34] C. Unger , L. Bühmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber , and P . Cimiano. T emplate-based question answering ov er rdf data. In WWW , pages 639–648, 2012. [35] Z. W ang, J. Zhang, J. Feng, and Z. Chen. Knowledge Graph Embedding by T ranslating on Hyperplanes. AAAI , pages 1112–1119, 2014. [36] R. Xie, Z. Liu, J. Jia, H. Luan, and M. Sun. Representation Learning of Knowledge Graphs with Entity Descriptions . AAAI , pages 189–205, 2016. [37] T . Zhou, J. Ren, M. Medo, and Y .-C. Zhang. Bipartite network projection and personal recommendation. Phys. Rev . E , 76(4):046–115, Oct. 2007. A A ppendix In this supplement we provide a detailed algorithm description of the proposed ProjE_wlistwise model in Alg. 1, of which ProjE_listwise is a special case. Next, two more experiments are sho wn to demonstrate the training stability and scaling potential of ProjE. A.1 T raining ProjE In Alg. 1, we describe the training process of ProjE_wlistwise. F or a gi ven training triple set S , we first construct the actual training set by randomly corrupting either the head entity h or tail entity t , and then generate the corresponding positiv e and negativ e candidates from S using candidate sampling if requested. Then for each mini-batch in the newly generated training data set, we calculate the loss and update the parameters accordingly . A.2 Model Stability In order to assess the training stability , we plotted the mean rank, filtered mean rank, HITS@10 and filtered HITS@10 ov er the first 25 training iterations on the FB15K dataset. F or the purpose of illustration, we also draw three dashed lines representing the top- 3 existing models that achieved the best performance in each metric. As shown in Fig. 2, the performance of all three ProjE v ariants become stable after the first few iterations due to the use of Adam optimizer . The score variation between each iteration is also low , indicating stable training progress. The ProjE_wlistwise variant performed the best across all tests, follo wed by ProjE_listwise and ProjE_pointwise respectiv ely . 11 Input: T raining triples S = { ( h, r , t ) } , entities E , relations R , embedding dimension k , dropout probability p d , candidate sampling rate p y , regularizer parameter α . initialize embedding matrices W E , W R , combination operators (diagonal matrices) D eh , D rh , D et , D rt with uniform( − 6 √ k , 6 √ k ) Loop / * A training iteration/epoch * / S h ← {} , T h ← {} , S t ← {} , T t ← {} ; / * training data * / for ( h, r , t ) ∈ S do / * construct training data using all training triples * / e ← random( h, t ) ; if e == h then / * tail is missing * / S h . add([ e, r ]) ; / * all positive tails from S and some sampled negative candidates * / T h . add( { t 0 | ( h, r , t 0 ) ∈ S } ∪ sample( E , p y )) ; else / * head is missing * / S t . add([ e, r ]) ; / * all positive heads from S and some sampled negative candidates * / T t . add( { h 0 | ( h 0 , r , t ) ∈ S } ∪ sample( E , p y )) ; end end for eac h ( S h b , T h b , S t b , T t b ) ⊂ ( S h , T h , S t , T t ) do / * mini-batches * / l ← 0 ; for ( s h , t h , s t , t t ) ∈ ( S h b , T h b , S t b , T t b ) do / * training instance * / o h ← softmax( W E [ t t , :] × tanh(drop out( p d , D et × ( W E [ s t [ 0 ] , :] ) T + D rt × ( W R [ s t [ 1 ] , :] ) T + b c )) + b p ) ; o t ← softmax( W E [ t h , :] × tanh(drop out( p d , D eh × ( W E [ s h [ 0 ] , :] ) T + D rh × ( W R [ s h [ 1 ] , :] ) T + b c )) + b p ) ; l = l − Σ( { 1 (( h, s t [1] , s t [0]) ∈ S ) | h ∈ t t } ◦ log ( o h )) − Σ( { 1 (( s h [0] , s h [1] , t ) ∈ S ) | t ∈ t h } ◦ log ( o t )) ; end / * L1 loss * / l r ← Regu 1 ( W E )+ Regu 1 ( W R )+ Regu 1 ( D eh )+ Regu 1 ( D rh )+ Regu 1 ( D et )+ Regu 1 ( D rt ) ; update all parameters w .r .t. l + α l r ; end EndLoop Algorithm 1: Algorithm of ProjE_wlistwise T raining. ◦ is Hadamard product and × is matrix product. 12 A.3 Candidate Sampling In order to e v aluate the relationship between the sampling rate and the model performance, we plotted fi v e dif ferent p y rates from 5% to 95% using the ProjE_wlistwise v ariant. All settings e xcept p y = 5% achie v ed better performance than the top-3 e xisting methods in each metric. These results demonstrate that that we can use ProjE with a relati v ely small sampling rate ( 25% ), b ut it also demonstrates that ProjE is rob ust in the presence of dif ferent positi v e-to-ne g ati v e training data ratios. Indeed, we find that the best results are often achie v ed under the 25% sampling ratio. This rob ustness pro vides ProjE the ability to handle v ery lar ge datasets by significantly reducing the acti v e w orking set. (a) Mean Rank (b) Filtered Mean Rank (c) Hits at 10 (d) Filtered Hits at 10 Figur e 2: ProjE v ariants on the FB15K dataset. Each plot contains three dashed lines representing the top- 3 e xisting models that achie v ed the best performance in each metric. 13 (a) Mean Rank (b) Filtered Mean Rank (c) Hits at 10 (d) Filtered Hits at 10 Figur e 3: ProjE_wlistwise with dif f erent candidate sampling p y rate on FB15K. Each plot contains three dashed lines representing the top- 3 e xisting models that achie v ed the best performance in each metric. 14
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment