A density compensation-based path computing model for measuring semantic similarity

The shortest path between two concepts in a taxonomic ontology is commonly used to represent the semantic distance between concepts in the edge-based semantic similarity measures. In the past, the edge counting is considered to be the default method for the path computation, which is simple, intuitive and has low computational complexity. However, a large lexical taxonomy of such as WordNet has the irregular densities of links between concepts due to its broad domain but. The edge counting-based path computation is powerless for this non-uniformity problem. In this paper, we advocate that the path computation is able to be separated from the edge-based similarity measures and form various general computing models. Therefore, in order to solve the problem of non-uniformity of concept density in a large taxonomic ontology, we propose a new path computing model based on the compensation of local area density of concepts, which is equal to the number of direct hyponyms of the subsumers of concepts in their shortest path. This path model considers the local area density of concepts as an extension of the edge-based path and converts the local area density divided by their depth into the compensation for edge-based path with an adjustable parameter, which idea has been proven to be consistent with the information theory. This model is a general path computing model and can be applied in various edge-based similarity algorithms. The experiment results show that the proposed path model improves the average correlation between edge-based measures with human judgments on Miller and Charles benchmark from less than 0.8 to more than 0.85, and has a big advantage in efficiency than information content (IC) computation in a dynamic ontology, thereby successfully solving the non-uniformity problem of taxonomic ontology.

💡 Research Summary

The paper addresses a well‑known shortcoming of edge‑based semantic similarity measures that rely on the simple counting of edges along the shortest path between two concepts in a taxonomic ontology such as WordNet. While edge counting is attractive because of its intuitive interpretation and low computational cost, large‑scale lexical taxonomies exhibit highly irregular link densities: some sub‑domains are densely populated with many hyponyms, whereas others are sparse. Consequently, the same number of edges can represent very different semantic distances, and the traditional approach fails to capture this non‑uniformity.

To overcome this problem, the authors propose a density‑compensation path model that separates the path‑computation component from the similarity‑scoring component, allowing the former to be swapped into any existing edge‑based similarity formula (e.g., Leacock‑Chodorow, Wu‑Palmer, Lin, Resnik). The key idea is to augment the raw edge count with a term that reflects the local area density of concepts that lie on the shortest path. Local area density is defined as the sum of the numbers of direct hyponyms of all subsumer nodes (i.e., ancestors) that appear on the path. Intuitively, a path that traverses a densely populated region should be “shorter” in semantic terms than a path of equal length through a sparse region.

The compensation term is computed as

Comp = λ × (LocalDensity / AvgDepth)

where AvgDepth is the average depth of the nodes on the path, and **λ ∈