Relevance Score of Triplets Using Knowledge Graph Embedding - The Pigweed Triple Scorer at WSDM Cup 2017
📝 Abstract
Collaborative Knowledge Bases such as Freebase and Wikidata mention multiple professions and nationalities for a particular entity. The goal of the WSDM Cup 2017 Triplet Scoring Challenge was to calculate relevance scores between an entity and its professions/nationalities. Such scores are a fundamental ingredient when ranking results in entity search. This paper proposes a novel approach to ensemble an advanced Knowledge Graph Embedding Model with a simple bag-of-words model. The former deals with hidden pragmatics and deep semantics whereas the latter handles text-based retrieval and low-level semantics.
💡 Analysis
Collaborative Knowledge Bases such as Freebase and Wikidata mention multiple professions and nationalities for a particular entity. The goal of the WSDM Cup 2017 Triplet Scoring Challenge was to calculate relevance scores between an entity and its professions/nationalities. Such scores are a fundamental ingredient when ranking results in entity search. This paper proposes a novel approach to ensemble an advanced Knowledge Graph Embedding Model with a simple bag-of-words model. The former deals with hidden pragmatics and deep semantics whereas the latter handles text-based retrieval and low-level semantics.
📄 Content
Many entities usually have multiple professions or nationalities, and it is often desirable to rank the relevance of these individual triplets. The goal of the challenge was to compute a score in the range [0, 7] that measures the relevance of the statement expressed by the individual triplet compared to other triplets from the same relation. Participants were provided with a list of 385,426 entities along with five files,
• profession.kb: all professions for a set of 343,329 entities • nationality.kb: all nationalities for a set of 301,590 entities • profession.train: relevance scores for 515 tuples (pertaining to 134 entities) from profession.kb
• nationality.kb: relevance scores for 162 tuples (pertaining to 77 entities) from nationality.kb
• nationality.kb: 33,159,353 sentences from Wikipedia with annotations of the 385,426 entities Apart from these, the participants were allowed to use any kind or amount of additional data (except for human/judgements). The output of this task was to generate relevance scores for all the triplets, 0 being the lowest relevance, and 7 being the highest.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
We used a pipeline based ensemble model, the system diagram of which is shown in Figure 1.
The first part of our approach makes use of a shallow learning technique for Knowledge Graph Embedding, TransR [4]. TransR is a translation-based embedding model which, given a fact of a knowledge base, represented by a triplet (h, r, t) where h, r, t indicate a head entity, a relation and a tail entity respectively. aims to learn embedding vectors h , r and t.
In TransR, for each triplet (h, r, t), entity embeddings are set as h, t ∈ R k and relation embedding is set as r ∈ R d . Note that the dimensions of entity embeddings and relation embeddings are not necessarily identical, i.e., k = d. For each relation r, TransR sets a projection matrix Mr ∈ R k×d , which may project entities from entity space to relation space. With the mapping matrix, TransR defines the projected vectors of entities as, hr = hMr, tr = tMr
TransR defines the score function as,
Training phase uses margin-based ranking loss to encourage discrimination between golden triplets and incorrect triplets:
where [x]+ max(0, x), is the set of positive(golden) triplets, denotes the set of negative triplets generated by randomly shuffling heads or tails in the triplets (h, r, t). This step is known as Negative Sampling. γ is the margin separating positive and negative triplets.
The role of TransR in our approach is to learn embeddings of entities such that entities related to each other will be closer in the euclidean space. The approach is similar to using a Word2Vec [5] model except that TransR uses triplets data as training input.
In order to achieve semantically sound embedding, we introduced some more triplets data using the Wikidata knowledge base. Apart from the triplets dataset provided, we extracted triplets for relations such as place_of_birth, place_of_death, employer, country, capital, etc. Such relations are closely related to the relations profession and nationality. The idea was to provide the model with more knowledge about the entities so that it can learn a semantically well organised vector embedding.
We used the trained embeddings to rank all the triplets in the training dataset using the score function fr(h, t). Profession/Nationality more closely related to a particular entity had relatively smaller value of the score function as compared to other professions/nationalities.
Keeping in mind that the scores were crowdsourced using human judgement and analysing the training files, we made the following observations,
• Similar profession pairs, such as [Actor, Model] or [Singer, Composer, Songwriter], had similar scores in the sample data provided
• In case of multiple nationalities, scores were biased towards the birth country
• TransR predicted the most relevant profession/nationality of a particular entity correctly, but had difficulty ranking the other, not so relevant, professions/nationalities
In order to modify the profession ranking based on the observations, we used an already trained Word2Vec model to generate 200x200 similarity distance matrix for all the given professions. We used a naive threshold based linear combination to modify the ranking obtained by TransR.
Based on observation 3, we improved the ranking of professions which were similar to the most relevant profession of an entity using the Word2Vec similarity matrix. For example, according to TransR, A. R. Rahman’s most relevant profession was singer-songwriter. So we impr
This content is AI-processed based on ArXiv data.