Exploring semantically-related concepts from Wikipedia: the case of SeRE

Exploring semantically-related concepts from Wikipedia: the case of SeRE
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper we present our web application SeRE designed to explore semantically related concepts. Wikipedia and DBpedia are rich data sources to extract related entities for a given topic, like in- and out-links, broader and narrower terms, categorisation information etc. We use the Wikipedia full text body to compute the semantic relatedness for extracted terms, which results in a list of entities that are most relevant for a topic. For any given query, the user interface of SeRE visualizes these related concepts, ordered by semantic relatedness; with snippets from Wikipedia articles that explain the connection between those two entities. In a user study we examine how SeRE can be used to find important entities and their relationships for a given topic and to answer the question of how the classification system can be used for filtering.


💡 Research Summary

The paper presents SeRE (Semantic Relatedness Explorer), a web‑based tool that discovers and visualizes concepts semantically related to a user’s query by exploiting Wikipedia and DBpedia. Unlike many existing semantic search systems that rely primarily on structured data such as DBpedia’s infoboxes or predefined ontology links, SeRE computes relatedness directly from Wikipedia’s full‑text content. The authors adapt the Normalized Google Distance (NGD) to Wikipedia, coining the Wikipedia Normalized Distance (WND). WND uses three counts obtained from Wikipedia’s full‑text search API: the number of hits for term A, the number of hits for term B, and the number of hits for the Boolean “A AND B”. These counts, together with the total number of Wikipedia articles, yield a similarity score between 0 and 1, where higher values indicate stronger semantic association.

The system workflow proceeds as follows. A user enters a query; the Wikipedia query module returns the ten most relevant articles, and the top result is taken as the canonical Wikipedia concept for the query. From this concept the system retrieves in‑links and out‑links via the Wikipedia API, and broader/narrower terms and category information via the DBpedia SPARQL endpoint. All candidate terms are collected into an array and processed in parallel. For each candidate, the WND score relative to the original concept is calculated. Candidates with a non‑zero score are enriched with a single most‑populated DBpedia category, a thumbnail (from Wikipedia or DBpedia), and textual snippets that explain the relationship. The snippets are obtained either by extracting the sentence containing the link in the original article or, if the link is absent, by issuing a Boolean “original AND candidate” full‑text search and taking the returned excerpts. The final result set is sorted first by the size of the assigned category (i.e., the number of entries in that category) and then by the WND score. An optional caching layer stores previously computed results, allowing the entire pipeline to respond within a few seconds even when hundreds of parallel API calls are issued.

The user interface evolved from an initial circular layout (similar to EyePlorer) to a more space‑efficient ranked list. The top of the interface contains a language selector (English/German), a search box with autocomplete, and an infobox showing the chosen concept’s thumbnail, link, and short description. Below, related concepts appear as small panels ordered by decreasing semantic relatedness. Each panel displays a thumbnail, a link to the Wikipedia article, and a colored bar whose hue ranges from red (high relatedness) to blue (low relatedness). This dual encoding—position in the list and color intensity—reinforces the relevance ranking. Hovering over a panel opens a popup with the textual snippet and additional links. A dropdown filter lets users restrict the list to specific DBpedia categories, supporting focused exploration. The system is multilingual; because Wikipedia editions differ in content and categorisation, the same query may yield different related concepts in English versus German.

To evaluate usability and effectiveness, the authors conducted a small user study with nine participants (eight male, one female, ages 26–40, all with graduate degrees in computer science). Participants performed three tasks—identifying five key persons in Angela Merkel’s political career, discovering possible relations between Angela Merkel and Jean‑Claude Juncker, and a third unspecified task—first using Google Search and then using SeRE. After each task they rated confidence, perceived difficulty, and provided qualitative feedback. The study aimed to answer whether SeRE’s interface is intuitive, how it compares to Google for finding related entities, whether sorting by semantic relatedness aids discovery, and which approach users prefer overall.

Results indicated that participants found SeRE’s visual cues (color coding, ordered list, snippet popups) helpful for understanding relationships and for filtering by category. The ability to see a concise set of highly related concepts reduced the cognitive load compared to scanning a long list of Google results. However, participants noted that Google still delivered a larger quantity of information and sometimes retrieved more up‑to‑date or diverse sources. The response time of SeRE, while generally a few seconds, was occasionally slower than Google, especially when many candidates were processed. Overall, participants expressed a preference for using SeRE when the goal is to explore semantic neighborhoods of a known entity, but they would revert to Google for broader, open‑ended searches.

The paper’s contributions are threefold: (1) introducing a full‑text‑based semantic relatedness metric (WND) that leverages Wikipedia’s rich textual corpus; (2) demonstrating a scalable, live computation pipeline that integrates Wikipedia and DBpedia via parallel API calls and caching; (3) presenting a UI that combines list‑based ranking with visual encoding to support intuitive exploration of related concepts. Limitations include dependence on Wikipedia’s search API rate limits, variability of results across language editions, and the small, highly educated participant pool, which restricts generalizability. Future work suggested includes expanding the evaluation to a larger, more diverse user base, integrating additional knowledge bases (e.g., domain‑specific ontologies) to improve coverage, and refining the relatedness metric by incorporating link‑based measures or embedding techniques.


Comments & Academic Discussion

Loading comments...

Leave a Comment