Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness
ArXiv ID: 1709.07357
Date: 2017-09-22
Authors: ** - Zhiguo Yu (University of Texas School of Biomedical Informatics, Houston, USA) - Byron C. Wallace (College of Computer and Information Science, Northeastern University, Boston, USA) - Todd Johnson (University of Texas School of Biomedical Informatics, Houston, USA) - Trevor Cohen (University of Texas School of Biomedical Informatics, Houston, USA) **

📝 Abstract

Estimation of semantic similarity and relatedness between biomedical concepts has utility for many informatics applications. Automated methods fall into two categories: methods based on distributional statistics drawn from text corpora, and methods using the structure of existing knowledge resources. Methods in the former category disregard taxonomic structure, while those in the latter fail to consider semantically relevant empirical information. In this paper, we present a method that retrofits distributional context vector representations of biomedical concepts using structural information from the UMLS Metathesaurus, such that the similarity between vector representations of linked concepts is augmented. We evaluated it on the UMNSRS benchmark. Our results demonstrate that retrofitting of concept vector representations leads to better correlation with human raters for both similarity and relatedness, surpassing the best results reported to date. They also demonstrate a clear improvement in performance on this reference standard for retrofitted vector representations, as compared to those without retrofitting.

💡 Deep Analysis

Deep Dive into Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness.

📄 Full Content

Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness Zhiguo Yua, Byron C. Wallaceb, Todd Johnsona, Trevor Cohena a The University of Texas School of Biomedical Informatics at Houston, Houston, Texas, USA, b College of Computer and Information Science, Northeastern University, Boston, Massachusetts, USA,

Abstract Estimation of semantic similarity and relatedness between biomedical concepts has utility for many informatics applications. Automated methods fall into two categories: methods based on distributional statistics drawn from text corpora, and methods using the structure of existing knowledge resources. Methods in the former category disregard taxonomic structure, while those in the latter fail to consider semantically relevant empirical information. In this paper, we present a method that retrofits distributional context vector representations of biomedical concepts using structural information from the UMLS Metathesaurus, such that the similarity between vector representations of linked concepts is augmented. We evaluated it on the UMNSRS benchmark. Our results demonstrate that retrofitting of concept vector representations leads to better correlation with human raters for both similarity and relatedness, surpassing the best results reported to date. They also demonstrate a clear improvement in performance on this reference standard for retrofitted vector representations, as compared to those without retrofitting. Keywords: Semantic Measures, Word Embedding, Distributional Semantics, Taxonomy Introduction Incorporation of semantically related terms and concepts can improve the retrieval [1; 2] and clustering [3] of biomedical documents; enhance literature-based discovery [4; 5]; and support the development of biomedical terminologies and ontologies [6]. However, automated estimation of the semantic relatedness between medical terms in a manner consistent with human judgment remains a challenge in the biomedical domain. Many existing semantic relatedness measures leverage the structure of an ontology or taxonomy (e.g. WordNet, the Unified Medical Language System (UMLS), or the Medical Subject Headings (MeSH)) to calculate, for example, the shortest path between concept nodes [7-9]. Alternatively, vector representations derived from distributional statistics drawn from a corpus of text can be used to calculate the relatedness between concepts [7; 10]. Other corpus-based methods use information content (IC) to estimate the semantic relatedness between two concepts, from the probability of these concepts co-occurring [9; 11; 12]. This raises the question of whether knowledge- or corpus-based metrics are most consistent with human judgment. In 2012, Garla and Brant [13] evaluated a wide range of lexical semantic measures, including both knowledge-based approaches leveraging the structure of an ontology or taxonomy [7; 14; 15] and distributional (corpus-based) approaches relying on co-occurrence statistics to estimate relatedness between concepts [16; 17]. This systematic investigation used several publicly available benchmarks. The most comprehensive of these is the University of Minnesota Semantic Relatedness Standards (UMNSRS), which contains the largest number and diversity of medical term pairs of any reference standard to date [18]. Medical terms in the set have been mapped to Concept Unique Identifiers (CUIs) in the UMLS, and term pairs have been annotated by human raters for similarity (e.g. Lipitor and Zocor are similar) and relatedness (e.g. Diabetes and Insulin are related). The best Spearman rank correlation for relatedness and similarity on this benchmark reported in [13] are 0.39 and 0.46 respectively. Neural network based models that are trained to predict neighboring terms to observed terms, such as the architectures implemented by the word2vec package [19], have gained popularity as a way to obtain distributional vector representations of terms. Vectors induced in this way have been shown to effectively capture analogical relationships between words [20], and under optimized hyperparameter settings these models have been shown to achieve better correlation with human judgment than prior distributional models such as Pointwise Mutual Information (PMI) and Latent Semantic Analysis (LSA) on some word similarity and analogy reference datasets [21; 22]. However, embedding models are trained on terms, not concepts. In 2014 De Vine [23] and his colleagues demonstrated that word embedding models trained on sequences of UMLS concepts (rather than sequences of terms) outperformed established corpus-based approaches such as Random Indexing [24] and LSA [25]. In 2014 Sajadi et al. reported that a graph-based approach (HITS-sim) leveraging Wikipedia as a network outperformed word2vec trained on the OHSUMED corpus for the UMNSRS benchmark, with

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

Retrofitting Concept Vector Representations of Medical Concepts to Improve Estimates of Semantic Similarity and Relatedness

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: AQA-Webcorp

A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations

Automatic derivation of domain terms and concept location based on the analysis of the identifiers

Start searching

No results found