Content-based Text Categorization using Wikitology

A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assign a real number between 0 and 1 to a pair of documents, depending upon the degree of similarity between them. A value of zero means that the documents are completely dissimilar whereas a value of one indicates that the documents are practically identical. Traditionally, vector-based models have been used for computing the document similarity. The vector-based models represent several features present in documents. These approaches to similarity measures, in general, cannot account for the semantics of the document. Documents written in human languages contain contexts and the words used to describe these contexts are generally semantically related. Motivated by this fact, many researchers have proposed semantic-based similarity measures by utilizing text annotation through external thesauruses like WordNet (a lexical database). In this paper, we define a semantic similarity measure based on documents represented in topic maps. Topic maps are rapidly becoming an industrial standard for knowledge representation with a focus for later search and extraction. The documents are transformed into a topic map based coded knowledge and the similarity between a pair of documents is represented as a correlation between the common patterns. The experimental studies on the text mining datasets reveal that this new similarity measure is more effective as compared to commonly used similarity measures in text clustering.

💡 Research Summary

The paper tackles one of the most computationally intensive steps in text clustering – the calculation of pairwise document similarity – by moving away from traditional vector‑based approaches that rely solely on term frequencies. Conventional models such as TF‑IDF vectors combined with cosine, Jaccard, or Euclidean distances capture only surface‑level lexical overlap and ignore the rich semantic relationships that naturally occur in human language. While earlier attempts to incorporate semantics have used lexical resources like WordNet, these resources are limited in coverage, especially for emerging terminology and domain‑specific concepts.

To overcome these shortcomings, the authors propose a novel similarity measure built on topic maps, an ISO‑standard knowledge‑representation format that encodes concepts (topics) and the relationships between them as a directed graph. The workflow consists of three main stages: (1) Text preprocessing – tokenization, part‑of‑speech tagging, noun‑phrase extraction, and named‑entity recognition; (2) Mapping to external knowledge – each extracted term is linked to a corresponding Wikipedia‑derived entity using the Wikitology repository, which supplies hierarchical (broader‑narrower), associative, and equivalence relations; (3) Construction of a topic‑map graph – the document is transformed into a multi‑layered graph where nodes represent topics and edges represent the various semantic relations.

Similarity between two documents is then computed by identifying common sub‑graphs within their respective topic maps. The authors define three quantitative components for this comparison: (i) the count of shared topics, (ii) the proportion of shared relation types (e.g., “is‑a”, “part‑of”, “related‑to”), and (iii) a sub‑graph matching score that combines node‑matching and edge‑matching metrics. These components are weighted and summed to produce a similarity value in the interval

💡 Research Summary

📜 Original Paper Content