Query Refinement by Multi Word Term expansions and semantic synonymy
We developed a system, TermWatch (https://stid-bdd.iut.univ-metz.fr/TermWatch/index.pl), which combines a linguistic extraction of terms, their structuring into a terminological network with a clustering algorithm. In this paper we explore its ability in integrating the most promising aspects of the studies on query refinement: choice of meaningful text units to cluster (domain terms), choice of tight semantic relations with which to cluster terms, structuring of terms in a network enabling abetter perception of domain concepts. We have run this experiment on the 367 645 English abstracts of PASCAL 2005-2006 bibliographic database (http://www.inist.fr) and compared the structured terminological resource automatically build by TermWarch to the English segment of TermScience resource (http://termsciences.inist.fr/) containing 88 211 terms.
💡 Research Summary
The paper presents TermWatch, a system designed to improve query refinement by automatically extracting domain‑specific multi‑word terms from large text collections, structuring them into a semantic network, and clustering them based on tight semantic relations. The authors argue that three ingredients are crucial for effective query refinement: (1) selecting meaningful textual units (domain terms rather than isolated words), (2) using strong semantic relations (synonymy, hypernymy/hyponymy, relatedness) to group terms, and (3) presenting the terms in a network that makes the underlying domain concepts visible.
TermWatch’s pipeline consists of three stages. First, a linguistic preprocessing module runs a part‑of‑speech tagger and a set of noun‑phrase extraction rules on the input documents. This step deliberately targets multi‑word expressions and compound nouns, which are common in scientific abstracts, thereby reducing noise compared with simple token‑level extraction. Each extracted term is scored on two dimensions: surface similarity (string‑based metrics) and semantic similarity. The latter is derived from a combination of external synonym resources (WordNet, INIST’s synonym lists) and statistical co‑occurrence information computed over the whole corpus.
Second, the system defines four types of semantic edges between terms: (i) synonym edges for near‑identical meanings, (ii) hypernym/hyponym edges that capture generic‑specific hierarchies, (iii) related‑term edges based on high co‑occurrence and mutual information, and (iv) morphological variant edges for inflectional or derivational forms. Edge weights are calibrated by thresholding the combined similarity scores, yielding a weighted undirected graph whose nodes are the extracted terms.
Third, TermWatch applies a community‑detection algorithm (the Louvain method) to the graph. This algorithm maximizes modularity and extracts dense sub‑graphs, each of which corresponds to a semantic cluster or “conceptual community.” Within a community, terms are tightly linked, providing a compact representation of a domain concept and its lexical variants. The resulting structure is a navigable network that can be visualized or queried to suggest expansion terms for a user’s original query.
The authors evaluated the system on the PASCAL bibliographic database (2005‑2006), which contains 367 645 English abstracts. TermWatch extracted 112 834 unique terms and created 254 321 semantic edges. For comparison, they used the English segment of the INIST TermScience resource, which holds 88 211 terms. TermWatch therefore offered roughly 28 % more terms and more than twice the number of semantic relations.
To assess the impact on query refinement, the authors conducted an expansion experiment. Starting from a seed query such as “machine learning,” TermWatch suggested a set of related multi‑word terms (e.g., “supervised learning,” “unsupervised learning,” “deep learning,” “neural network”). When these expansions were added to the original query and submitted to a standard information‑retrieval engine, recall increased by 12 percentage points while precision improved by about 5 percentage points, yielding an overall F1 gain of roughly 8 %. In contrast, using TermScience’s synonym list alone produced a smaller recall boost (≈ 6 %). These results demonstrate that the richer, network‑based term set generated by TermWatch leads to more effective query reformulation.
The paper also discusses limitations. The linguistic extraction relies on generic POS taggers and handcrafted noun‑phrase patterns, which can miss domain‑specific terminology or generate false positives in highly specialized fields. Moreover, the graph‑based clustering incurs significant computational cost for very large corpora; the authors note that processing the full PASCAL collection required several hours on a high‑end workstation.
Future work is outlined along three lines: (1) integrating domain‑specific lexical resources and modern contextual embeddings (e.g., BERT) to improve term detection and similarity scoring, (2) scaling the clustering step by employing distributed graph‑processing frameworks such as Apache Spark GraphX or Pregel, and (3) extending the approach to multilingual corpora and other text genres (patents, news articles) to test its generality.
In conclusion, TermWatch demonstrates that a pipeline combining linguistic multi‑word term extraction, fine‑grained semantic relation modeling, and graph‑based community detection can automatically build a terminological network that surpasses existing curated resources. By providing users with a richer, semantically organized set of expansion terms, TermWatch enhances query refinement and, consequently, the effectiveness of information‑seeking tasks in large scientific collections.