📝 Original Info
- Title: Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness
- ArXiv ID: 1709.07357
- Date: 2017-09-22
- Authors: ** - Zhiguo Yu (University of Texas School of Biomedical Informatics, Houston, USA) - Byron C. Wallace (College of Computer and Information Science, Northeastern University, Boston, USA) - Todd Johnson (University of Texas School of Biomedical Informatics, Houston, USA) - Trevor Cohen (University of Texas School of Biomedical Informatics, Houston, USA) **
📝 Abstract
Estimation of semantic similarity and relatedness between biomedical concepts has utility for many informatics applications. Automated methods fall into two categories: methods based on distributional statistics drawn from text corpora, and methods using the structure of existing knowledge resources. Methods in the former category disregard taxonomic structure, while those in the latter fail to consider semantically relevant empirical information. In this paper, we present a method that retrofits distributional context vector representations of biomedical concepts using structural information from the UMLS Metathesaurus, such that the similarity between vector representations of linked concepts is augmented. We evaluated it on the UMNSRS benchmark. Our results demonstrate that retrofitting of concept vector representations leads to better correlation with human raters for both similarity and relatedness, surpassing the best results reported to date. They also demonstrate a clear improvement in performance on this reference standard for retrofitted vector representations, as compared to those without retrofitting.
💡 Deep Analysis
Deep Dive into Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness.
Estimation of semantic similarity and relatedness between biomedical concepts has utility for many informatics applications. Automated methods fall into two categories: methods based on distributional statistics drawn from text corpora, and methods using the structure of existing knowledge resources. Methods in the former category disregard taxonomic structure, while those in the latter fail to consider semantically relevant empirical information. In this paper, we present a method that retrofits distributional context vector representations of biomedical concepts using structural information from the UMLS Metathesaurus, such that the similarity between vector representations of linked concepts is augmented. We evaluated it on the UMNSRS benchmark. Our results demonstrate that retrofitting of concept vector representations leads to better correlation with human raters for both similarity and relatedness, surpassing the best results reported to date. They also demonstrate a clear improv
📄 Full Content
Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of
Semantic Similarity and Relatedness
Zhiguo Yua, Byron C. Wallaceb, Todd Johnsona, Trevor Cohena
a The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA,
b College of Computer and Information Science, Northeastern University, Boston, Massachusetts, USA,
Abstract
Estimation of semantic similarity and relatedness between
biomedical concepts has utility for many informatics
applications. Automated methods fall into two categories:
methods based on distributional statistics drawn from text
corpora, and methods using the structure of existing
knowledge resources. Methods in the former category
disregard taxonomic structure, while those in the latter fail to
consider semantically relevant empirical information. In this
paper, we present a method that retrofits distributional
context vector representations of biomedical concepts using
structural information from the UMLS Metathesaurus, such
that the similarity between vector representations of linked
concepts is augmented. We evaluated it on the UMNSRS
benchmark. Our results demonstrate that retrofitting of
concept vector representations leads to better correlation with
human raters for both similarity and relatedness, surpassing
the best results reported to date. They also demonstrate a
clear improvement in performance on this reference standard
for retrofitted vector representations, as compared to those
without retrofitting.
Keywords: Semantic Measures, Word Embedding,
Distributional Semantics, Taxonomy
Introduction
Incorporation of semantically related terms and concepts can
improve the retrieval [1; 2] and clustering [3] of biomedical
documents; enhance literature-based discovery [4; 5]; and
support the development of biomedical terminologies and
ontologies [6]. However, automated estimation of the
semantic relatedness between medical terms in a manner
consistent with human judgment remains a challenge in the
biomedical domain. Many existing semantic relatedness
measures leverage the structure of an ontology or taxonomy
(e.g. WordNet, the Unified Medical Language System
(UMLS), or the Medical Subject Headings (MeSH)) to
calculate, for example, the shortest path between concept
nodes [7-9]. Alternatively, vector representations derived from
distributional statistics drawn from a corpus of text can be
used to calculate the relatedness between concepts [7; 10].
Other corpus-based methods use information content (IC) to
estimate the semantic relatedness between two concepts, from
the probability of these concepts co-occurring [9; 11; 12]. This
raises the question of whether knowledge- or corpus-based
metrics are most consistent with human judgment.
In 2012, Garla and Brant [13] evaluated a wide range of
lexical semantic measures, including both knowledge-based
approaches leveraging the structure of an ontology or
taxonomy [7; 14; 15] and distributional (corpus-based)
approaches relying on co-occurrence statistics to estimate
relatedness between concepts [16; 17]. This systematic
investigation used several publicly available benchmarks. The
most comprehensive of these is the University of Minnesota
Semantic Relatedness Standards (UMNSRS), which contains
the largest number and diversity of medical term pairs of any
reference standard to date [18]. Medical terms in the set have
been mapped to Concept Unique Identifiers (CUIs) in the
UMLS, and term pairs have been annotated by human raters
for similarity (e.g. Lipitor and Zocor are similar) and
relatedness (e.g. Diabetes and Insulin are related). The best
Spearman rank correlation for relatedness and similarity on
this benchmark reported in [13] are 0.39 and 0.46 respectively.
Neural network based models that are trained to predict
neighboring terms to observed terms, such as the architectures
implemented by the word2vec package [19], have gained
popularity as a way to obtain distributional vector
representations of terms. Vectors induced in this way have
been shown to effectively capture analogical relationships
between words [20], and under optimized hyperparameter
settings these models have been shown to achieve better
correlation with human judgment than prior distributional
models such as Pointwise Mutual Information (PMI) and
Latent Semantic Analysis (LSA) on some word similarity and
analogy reference datasets [21; 22]. However, embedding
models are trained on terms, not concepts. In 2014 De Vine
[23] and his colleagues demonstrated that word embedding
models trained on sequences of UMLS concepts (rather than
sequences of terms) outperformed established corpus-based
approaches such as Random Indexing [24] and LSA [25].
In 2014 Sajadi et al. reported that a graph-based approach
(HITS-sim) leveraging Wikipedia as a network outperformed
word2vec trained on the OHSUMED corpus for the UMNSRS
benchmark, with
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.