Toward Network-based Keyword Extraction from Multitopic Web Documents
In this paper we analyse the selectivity measure calculated from the complex network in the task of the automatic keyword extraction. Texts, collected from different web sources (portals, forums), are represented as directed and weighted co-occurrence complex networks of words. Words are nodes and links are established between two nodes if they are directly co-occurring within the sentence. We test different centrality measures for ranking nodes - keyword candidates. The promising results are achieved using the selectivity measure. Then we propose an approach which enables extracting word pairs according to the values of the in/out selectivity and weight measures combined with filtering.
💡 Research Summary
The paper proposes an unsupervised, language‑independent method for extracting keywords from noisy, multitopic web documents by exploiting the selectivity measure of complex word‑co‑occurrence networks. Four Croatian web corpora (two portals, a forum index, and a daily newspaper) were collected, pre‑processed (symbol cleaning, diacritic normalization, punctuation removal) and transformed into directed, weighted graphs: each unique word becomes a node, and a directed edge from word i to word j is created whenever the two appear consecutively in a sentence, with the edge weight equal to the number of such co‑occurrences. The resulting networks contain between 9,500 and 27,700 nodes and 25,000–105,000 edges, depending on the dataset.
Traditional centrality metrics—degree, closeness, and betweenness—were first evaluated. When ranking nodes by these measures, the top‑10 lists were dominated by function words (e.g., “i”, “je”, “na”), confirming that classic graph‑based keyword extraction often returns stop‑words in web‑scale, noisy texts.
The authors then focus on selectivity, defined as the ratio of a node’s strength (sum of incident edge weights) to its degree (number of incident edges). For directed graphs, in‑selectivity and out‑selectivity are computed analogously using in‑strength/in‑degree and out‑strength/out‑degree. Selectivity captures how strongly a word is connected to a diverse set of neighbours, thereby highlighting content‑bearing terms. Indeed, the top‑10 nodes by selectivity are predominantly open‑class words (nouns, verbs, adjectives) such as “mladičevi”, “seksualnog”, and “skandal”, which are plausible keyword candidates.
To move from single words to multi‑word expressions, the authors introduce a pair extraction step. For each node, the neighbour with the highest edge weight is identified: for in‑selectivity the neighbour is the one with the strongest outgoing edge, and for out‑selectivity the neighbour with the strongest incoming edge. This yields a set of ordered word‑pairs (tuples) that reflect strong collocations. Three filtering strategies are then applied: (1) a stop‑word filter that discards any tuple containing a function word; (2) a high‑weight filter that retains only tuples where the selectivity value equals the edge weight (i.e., the node’s strength is concentrated on that single neighbour); and (3) a combined filter that enforces both conditions.
The resulting tuples were examined for each corpus. In the “NN” dataset, examples such as “nacionalne novine” (national newspapers), “srpsku nacionalnu” (Serbian national), and “ovjesne jedrilice” (hang gliders) appear among the highest‑ranked pairs, demonstrating that the method successfully isolates meaningful domain‑specific collocations. Similar patterns were observed in the other three corpora, confirming the robustness of the approach across different topics and document types.
Key advantages of the proposed pipeline are its unsupervised nature (no labeled data or supervised learning required) and its minimal linguistic prerequisites—only a stop‑word list is needed for filtering. Consequently, the method can be deployed on large‑scale web crawls where manual annotation is infeasible. The authors acknowledge limitations: rare words with low co‑occurrence counts may receive low selectivity scores and be missed, and the approach does not capture deeper contextual semantics beyond co‑occurrence strength. Future work is suggested to integrate semantic embeddings or multi‑scale clustering to mitigate these issues.
In conclusion, the study demonstrates that selectivity‑based graph analysis is an effective tool for keyword extraction from multitopic, noisy web texts. By focusing on the ratio of strength to degree, the method filters out high‑frequency function words and surfaces content‑rich terms and collocations, offering a practical solution for applications such as automatic summarization, search engine optimization, and large‑scale text mining.
Comments & Academic Discussion
Loading comments...
Leave a Comment