Comparing intermittency and network measurements of words and their dependency on authorship

Comparing intermittency and network measurements of words and their   dependency on authorship
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many features from texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books from 8 authors who lived in the 19th and 20th centuries, for which the following network measurements were obtained: clustering coefficient, average shortest path lengths, and betweenness. We found that the two factors with stronger dependency on the authors were the skewness in the distribution of word intermittency and the average shortest paths. Other factors such as the betweeness and the Zipf’s law exponent show only weak dependency on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was a ca. 65 % accuracy upon combining complex network and intermittency features with the nearest neighbor algorithm. From a detailed analysis of the interdependence of the various metrics it is concluded that the methods used here are complementary for providing short- and long-scale perspectives of texts, which are useful for applications such as identification of topical words and information retrieval.


💡 Research Summary

The paper investigates how two distinct families of textual descriptors—complex‑network metrics derived from word co‑occurrence graphs and intermittency (burstiness) measures of word occurrence—vary with authorial style. The authors assembled a corpus of 40 novels, five works each from eight writers active between the early 19th and mid‑20th centuries. To control for length effects, every book was truncated to the first 18 200 tokens, which corresponds to the shortest text in the set. Pre‑processing removed stop‑words (articles, prepositions, adverbs) and applied lemmatization via the MXPost part‑of‑speech tagger, thereby collapsing inflectional variants into single lexical items and focusing the analysis on semantically meaningful words.

From each processed text a directed weighted co‑occurrence network was built: vertices represent distinct words, and a directed edge v_i → v_j carries weight w_ij equal to the number of times word v_j follows v_i immediately. An undirected, unweighted adjacency matrix A was also derived for traditional graph‑theoretic calculations. Three local network measures were computed for every vertex: (i) clustering coefficient C_i, quantifying the probability that two neighbors of a word are themselves linked; (ii) average shortest‑path length L_i, i.e., the mean geodesic distance from the word to all other vertices in the unweighted graph; and (iii) betweenness centrality B_i, reflecting how often a word lies on shortest paths between other word pairs. In parallel, intermittency was quantified by examining the distribution of inter‑arrival times for each word; the skewness (third central moment) of this distribution was taken as a global feature for the whole book.

Statistical analysis revealed that the skewness of the intermittency distribution and the average shortest‑path length display the strongest dependence on authorship. Words with high C_i tend to belong to tightly‑bound semantic fields (e.g., “sand”, “excitement”), whereas low‑C_i words appear in diverse contexts, but C_i showed only modest author discrimination. Betweenness B_i and the Zipf exponent of the word‑frequency distribution varied little across authors, indicating limited utility for stylometry. Correlation between L_i and raw frequency N_i was weak (Pearson r = ‑0.36), confirming that L_i captures structural information beyond mere word count.

To assess practical relevance, the authors constructed feature vectors for each book by aggregating the above metrics (e.g., mean, variance, skewness across words) and fed them to three classifiers: k‑nearest neighbours (k‑NN), support‑vector machines (SVM), and random forests. The best result—approximately 65 % correct attribution—was achieved by k‑NN using a combined set of network and intermittency features. Single‑feature models performed substantially worse (often near chance level), underscoring the complementary nature of the two descriptor families. When traditional stylometric features (function‑word frequencies, punctuation, etc.) were added, overall accuracy rose to above 90 %, demonstrating that the proposed physics‑inspired metrics can augment, but not replace, established methods.

The authors conclude that short‑scale (co‑occurrence) and long‑scale (burstiness) analyses provide mutually reinforcing perspectives on textual organization. Average shortest‑path length appears to reflect how closely a word is linked to the core vocabulary of a text, while intermittency skewness captures the uneven temporal deployment of content words—both of which are sensitive to an author’s stylistic habits. These insights have implications beyond authorship attribution, including topical‑word identification, information‑retrieval ranking, and the broader study of linguistic complexity. The work thus showcases how concepts from complex networks and dynamical systems can enrich computational linguistics and stylometry.


Comments & Academic Discussion

Loading comments...

Leave a Comment